Generative Models for Images

Generative models learn to create new data samples that resemble the training distribution. In computer vision, this means generating realistic images.

Generative Adversarial Networks (GANs)

Introduced by Ian Goodfellow in 2014, GANs train two networks in a minimax game:

Architecture

Generator (G): Takes random noise z and produces a fake image G(z)

Discriminator (D): Takes an image and classifies it as real or fake

Training Dynamics

The generator tries to fool the discriminator; the discriminator tries not to be fooled:

$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

Training alternates: 1. Train D: Show real images (label=1) and fake images (label=0), maximize classification accuracy 2. Train G: Generate fake images, try to make D classify them as real (label=1)

Mode Collapse

The biggest challenge in GAN training. The generator finds a few outputs that fool the discriminator and keeps producing only those, ignoring the full diversity of the data distribution.

Mitigation techniques:

Wasserstein GAN (WGAN): Uses Wasserstein distance instead of JS divergence

Spectral normalization: Stabilizes discriminator training

Progressive growing: Start with low resolution, gradually increase

Minibatch discrimination: Let D see batches of images, not just individual ones

Notable GAN Architectures

DCGAN: First stable CNN-based GAN (batch norm, no pooling, transposed convolutions)

StyleGAN/StyleGAN2/3: State-of-the-art face generation (style-based generator with AdaIN)

Pix2pix: Paired image-to-image translation

CycleGAN: Unpaired image-to-image translation (horse↔zebra, summer↔winter)

GANs vs VAEs vs Diffusion Models

**GANs**: Adversarial training (generator vs discriminator). Fast sampling, but training is unstable and prone to mode collapse. **VAEs**: Encode to a latent distribution, decode back. Stable training with a principled loss (reconstruction + KL divergence), but outputs tend to be blurry. **Diffusion Models**: Gradually add noise, then learn to reverse the process. Highest quality outputs, but slow sampling (many denoising steps). Currently dominant for image generation.

Variational Autoencoders (VAEs)

VAEs combine autoencoders with probabilistic inference:

Architecture

Encoder: Maps input x to a distribution in latent space: q(z|x) = N(\u03bc, \u03c3)

Decoder: Maps a latent sample z back to data space: p(x|z)

The Reparameterization Trick

Instead of sampling z ~ N(\u03bc, \u03c3) (which is not differentiable), compute: $$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, 1)$$ This makes the sampling operation differentiable, allowing backpropagation through the encoder.

Loss Function

$$L = L_{\text{reconstruction}} + \beta \cdot L_{\text{KL}}$$

Reconstruction loss: How well the decoder reconstructs the input (MSE or BCE)

KL divergence: How close the learned latent distribution is to a standard normal N(0, 1)

$$D_{KL}(q(z|x) \| p(z)) = -\frac{1}{2}\sum(1 + \log \sigma^2 - \mu^2 - \sigma^2)$$

Latent Space Properties

Continuity: Nearby points in latent space produce similar images

Interpolation: Walking between two latent points produces a smooth transition

Manipulation: Arithmetic in latent space (e.g., smiling face - neutral face + another face = smiling version)

Diffusion Models

The current state of the art for image generation. Based on non-equilibrium thermodynamics.

Forward Process (Adding Noise)

Gradually add Gaussian noise to an image over T timesteps: $$q(x_t | x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

After enough steps, the image becomes pure noise. The noise schedule {\u03b2_t} controls how fast noise is added (linear, cosine, etc.).

Reverse Process (Denoising)

Learn a neural network \u03b5_\u03b8 to predict the noise added at each step: $$p_\theta(x_{t-1} | x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$

The model is trained with a simple MSE loss: $$L = \mathbb{E}_{t, x_0, \epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$$

The U-Net Denoiser

The noise predictor is typically a U-Net with:

Time embedding: Sinusoidal position encoding of the timestep t, injected into each block

Self-attention layers: For global context and coherence

Cross-attention layers: For conditioning on text (in text-to-image models)

Skip connections: Standard U-Net encoder-decoder architecture

Stable Diffusion Architecture

Stable Diffusion (Latent Diffusion Model) is the most widely used open-source image generation model:

Key Innovation: Latent Space Diffusion

Instead of diffusing in pixel space (512x512x3 = 786K dimensions), Stable Diffusion operates in a compressed latent space (64x64x4 = 16K dimensions):

1. VAE Encoder: Compresses image from pixel space to latent space (8x spatial compression) 2. U-Net: Performs the denoising diffusion in latent space 3. VAE Decoder: Decompresses the denoised latent back to pixel space

This makes training and inference ~50x more efficient than pixel-space diffusion.

Text Conditioning

Text prompt is encoded by a text encoder (CLIP or T5)

Text embeddings are injected into the U-Net via cross-attention layers

This allows the model to generate images matching text descriptions

Classifier-Free Guidance (CFG)

At inference, compute both conditional and unconditional predictions: $$\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$$

Scale s (guidance scale) controls how strongly the output follows the text prompt

s=1: No guidance (like unconditional generation)

s=7-8: Good balance (typical default)

s>15: Very strong adherence but may reduce quality/diversity

ControlNet

Adds spatial control to diffusion models:

Takes an additional "control" input (edge map, depth map, pose skeleton, etc.)

Copies the U-Net encoder and trains it on paired data

Zero convolution layers ensure training starts from the pretrained model

Enables precise spatial control while maintaining generation quality

Other Techniques

Image-to-Image (img2img): Start denoising from a noisy version of an input image instead of pure noise. The noise strength controls how much the output diverges from the input.

Inpainting: Mask part of an image and regenerate only the masked region. The model fills in the masked area while keeping the rest of the image intact.

LoRA (Low-Rank Adaptation): Fine-tune a diffusion model on a small dataset by training only low-rank weight updates. Enables personalization (custom styles, faces, objects) with minimal compute.

python

1# ==============================================================
2# Simple GAN for MNIST in PyTorch
3# ==============================================================
4import torch
5import torch.nn as nn
6import torchvision
7import torchvision.transforms as T
8from torch.utils.data import DataLoader
9
10# Hyperparameters
11latent_dim = 100
12img_dim = 28 * 28  # Flattened MNIST
13batch_size = 128
14lr = 2e-4
15epochs = 50
16
17# Data
18transform = T.Compose([T.ToTensor(), T.Normalize([0.5], [0.5])])
19dataset = torchvision.datasets.MNIST("./data", train=True,
20                                      download=True, transform=transform)
21loader = DataLoader(dataset, batch_size, shuffle=True)
22
23# Generator
24class Generator(nn.Module):
25    def __init__(self):
26        super().__init__()
27        self.net = nn.Sequential(
28            nn.Linear(latent_dim, 256),
29            nn.LeakyReLU(0.2),
30            nn.BatchNorm1d(256),
31            nn.Linear(256, 512),
32            nn.LeakyReLU(0.2),
33            nn.BatchNorm1d(512),
34            nn.Linear(512, 1024),
35            nn.LeakyReLU(0.2),
36            nn.BatchNorm1d(1024),
37            nn.Linear(1024, img_dim),
38            nn.Tanh(),
39        )
40
41    def forward(self, z):
42        return self.net(z).view(-1, 1, 28, 28)
43
44# Discriminator
45class Discriminator(nn.Module):
46    def __init__(self):
47        super().__init__()
48        self.net = nn.Sequential(
49            nn.Linear(img_dim, 512),
50            nn.LeakyReLU(0.2),
51            nn.Dropout(0.3),
52            nn.Linear(512, 256),
53            nn.LeakyReLU(0.2),
54            nn.Dropout(0.3),
55            nn.Linear(256, 1),
56            nn.Sigmoid(),
57        )
58
59    def forward(self, img):
60        return self.net(img.view(-1, img_dim))
61
62device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
63G = Generator().to(device)
64D = Discriminator().to(device)
65opt_G = torch.optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
66opt_D = torch.optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))
67criterion = nn.BCELoss()
68
69# Training
70for epoch in range(epochs):
71    for real_imgs, _ in loader:
72        real_imgs = real_imgs.to(device)
73        batch = real_imgs.size(0)
74        real_labels = torch.ones(batch, 1, device=device)
75        fake_labels = torch.zeros(batch, 1, device=device)
76
77        # ---- Train Discriminator ----
78        z = torch.randn(batch, latent_dim, device=device)
79        fake_imgs = G(z).detach()
80        d_loss = (criterion(D(real_imgs), real_labels) +
81                  criterion(D(fake_imgs), fake_labels)) / 2
82        opt_D.zero_grad()
83        d_loss.backward()
84        opt_D.step()
85
86        # ---- Train Generator ----
87        z = torch.randn(batch, latent_dim, device=device)
88        fake_imgs = G(z)
89        g_loss = criterion(D(fake_imgs), real_labels)  # fool D
90        opt_G.zero_grad()
91        g_loss.backward()
92        opt_G.step()
93
94    if (epoch + 1) % 10 == 0:
95        print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss:.4f} | G Loss: {g_loss:.4f}")
96
97# Generate samples
98with torch.no_grad():
99    z = torch.randn(16, latent_dim, device=device)
100    samples = G(z).cpu()
101    grid = torchvision.utils.make_grid(samples, nrow=4, normalize=True)
102    import matplotlib.pyplot as plt
103    plt.figure(figsize=(6, 6))
104    plt.imshow(grid.permute(1, 2, 0).squeeze(), cmap="gray")
105    plt.title("Generated MNIST Digits")
106    plt.axis("off")
107    plt.show()

python

1# ==============================================================
2# Stable Diffusion with diffusers library
3# pip install diffusers transformers accelerate
4# ==============================================================
5from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
6from diffusers import StableDiffusionInpaintPipeline
7import torch
8from PIL import Image
9
10device = "cuda" if torch.cuda.is_available() else "cpu"
11
12# ---- Text-to-Image ----
13pipe = StableDiffusionPipeline.from_pretrained(
14    "stabilityai/stable-diffusion-2-1",
15    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
16)
17pipe = pipe.to(device)
18
19# Generate an image
20prompt = "A serene mountain lake at sunset, photorealistic, 4k"
21negative_prompt = "blurry, low quality, distorted"
22
23image = pipe(
24    prompt=prompt,
25    negative_prompt=negative_prompt,
26    num_inference_steps=50,
27    guidance_scale=7.5,
28    height=512,
29    width=512,
30).images[0]
31image.save("mountain_lake.png")
32
33# ---- Exploring guidance scale ----
34import matplotlib.pyplot as plt
35fig, axes = plt.subplots(1, 4, figsize=(20, 5))
36for i, scale in enumerate([1.0, 5.0, 7.5, 15.0]):
37    img = pipe(
38        prompt="A cat wearing a tiny top hat, oil painting",
39        guidance_scale=scale,
40        num_inference_steps=30,
41    ).images[0]
42    axes[i].imshow(img)
43    axes[i].set_title(f"CFG Scale = {scale}")
44    axes[i].axis("off")
45plt.suptitle("Effect of Classifier-Free Guidance Scale")
46plt.tight_layout()
47plt.show()
48
49# ---- Image-to-Image ----
50img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
51    "stabilityai/stable-diffusion-2-1",
52    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
53)
54img2img_pipe = img2img_pipe.to(device)
55
56init_image = Image.open("mountain_lake.png").resize((512, 512))
57result = img2img_pipe(
58    prompt="Same scene but in winter with snow, photorealistic",
59    image=init_image,
60    strength=0.75,  # 0 = no change, 1 = complete regeneration
61    guidance_scale=7.5,
62    num_inference_steps=50,
63).images[0]
64result.save("mountain_lake_winter.png")

GPU Memory Requirements

Stable Diffusion models require significant GPU memory: - **SD 1.5/2.1**: ~4-6 GB VRAM with float16 - **SDXL**: ~8-12 GB VRAM with float16 - **SD 3.0**: ~12-16 GB VRAM To reduce memory usage, use `pipe.enable_attention_slicing()` or `pipe.enable_model_cpu_offload()`. For CPU-only systems, use float32 but expect generation to take several minutes per image.

Generative Models for Images

Generative models learn to create new data samples that resemble the training distribution. In computer vision, this means generating realistic images.

Generative Adversarial Networks (GANs)

Introduced by Ian Goodfellow in 2014, GANs train two networks in a minimax game:

Architecture

Generator (G): Takes random noise z and produces a fake image G(z)

Discriminator (D): Takes an image and classifies it as real or fake

Training Dynamics

The generator tries to fool the discriminator; the discriminator tries not to be fooled:

$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

Mode Collapse

The biggest challenge in GAN training. The generator finds a few outputs that fool the discriminator and keeps producing only those, ignoring the full diversity of the data distribution.

Mitigation techniques:

Wasserstein GAN (WGAN): Uses Wasserstein distance instead of JS divergence

Spectral normalization: Stabilizes discriminator training

Progressive growing: Start with low resolution, gradually increase

Minibatch discrimination: Let D see batches of images, not just individual ones

Notable GAN Architectures

DCGAN: First stable CNN-based GAN (batch norm, no pooling, transposed convolutions)

StyleGAN/StyleGAN2/3: State-of-the-art face generation (style-based generator with AdaIN)

Pix2pix: Paired image-to-image translation

CycleGAN: Unpaired image-to-image translation (horse↔zebra, summer↔winter)

GANs vs VAEs vs Diffusion Models

Variational Autoencoders (VAEs)

VAEs combine autoencoders with probabilistic inference:

Architecture

Encoder: Maps input x to a distribution in latent space: q(z|x) = N(\u03bc, \u03c3)

Decoder: Maps a latent sample z back to data space: p(x|z)

The Reparameterization Trick

Loss Function

$$L = L_{\text{reconstruction}} + \beta \cdot L_{\text{KL}}$$

Reconstruction loss: How well the decoder reconstructs the input (MSE or BCE)

KL divergence: How close the learned latent distribution is to a standard normal N(0, 1)

$$D_{KL}(q(z|x) \| p(z)) = -\frac{1}{2}\sum(1 + \log \sigma^2 - \mu^2 - \sigma^2)$$

Latent Space Properties

Continuity: Nearby points in latent space produce similar images

Interpolation: Walking between two latent points produces a smooth transition

Manipulation: Arithmetic in latent space (e.g., smiling face - neutral face + another face = smiling version)

Diffusion Models

The current state of the art for image generation. Based on non-equilibrium thermodynamics.

Forward Process (Adding Noise)

Gradually add Gaussian noise to an image over T timesteps: $$q(x_t | x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

After enough steps, the image becomes pure noise. The noise schedule {\u03b2_t} controls how fast noise is added (linear, cosine, etc.).

Reverse Process (Denoising)

Learn a neural network \u03b5_\u03b8 to predict the noise added at each step: $$p_\theta(x_{t-1} | x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$

The model is trained with a simple MSE loss: $$L = \mathbb{E}_{t, x_0, \epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$$

The U-Net Denoiser

The noise predictor is typically a U-Net with:

Time embedding: Sinusoidal position encoding of the timestep t, injected into each block

Self-attention layers: For global context and coherence

Cross-attention layers: For conditioning on text (in text-to-image models)

Skip connections: Standard U-Net encoder-decoder architecture

Stable Diffusion Architecture

Stable Diffusion (Latent Diffusion Model) is the most widely used open-source image generation model:

Key Innovation: Latent Space Diffusion

Instead of diffusing in pixel space (512x512x3 = 786K dimensions), Stable Diffusion operates in a compressed latent space (64x64x4 = 16K dimensions):

This makes training and inference ~50x more efficient than pixel-space diffusion.

Text Conditioning

Text prompt is encoded by a text encoder (CLIP or T5)

Text embeddings are injected into the U-Net via cross-attention layers

This allows the model to generate images matching text descriptions

Classifier-Free Guidance (CFG)

At inference, compute both conditional and unconditional predictions: $$\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$$

Scale s (guidance scale) controls how strongly the output follows the text prompt

s=1: No guidance (like unconditional generation)

s=7-8: Good balance (typical default)

s>15: Very strong adherence but may reduce quality/diversity

ControlNet

Adds spatial control to diffusion models:

Takes an additional "control" input (edge map, depth map, pose skeleton, etc.)

Copies the U-Net encoder and trains it on paired data

Zero convolution layers ensure training starts from the pretrained model

Enables precise spatial control while maintaining generation quality

Other Techniques

Image-to-Image (img2img): Start denoising from a noisy version of an input image instead of pure noise. The noise strength controls how much the output diverges from the input.

Inpainting: Mask part of an image and regenerate only the masked region. The model fills in the masked area while keeping the rest of the image intact.

LoRA (Low-Rank Adaptation): Fine-tune a diffusion model on a small dataset by training only low-rank weight updates. Enables personalization (custom styles, faces, objects) with minimal compute.

python

1# ==============================================================
2# Simple GAN for MNIST in PyTorch
3# ==============================================================
4import torch
5import torch.nn as nn
6import torchvision
7import torchvision.transforms as T
8from torch.utils.data import DataLoader
9
10# Hyperparameters
11latent_dim = 100
12img_dim = 28 * 28  # Flattened MNIST
13batch_size = 128
14lr = 2e-4
15epochs = 50
16
17# Data
18transform = T.Compose([T.ToTensor(), T.Normalize([0.5], [0.5])])
19dataset = torchvision.datasets.MNIST("./data", train=True,
20                                      download=True, transform=transform)
21loader = DataLoader(dataset, batch_size, shuffle=True)
22
23# Generator
24class Generator(nn.Module):
25    def __init__(self):
26        super().__init__()
27        self.net = nn.Sequential(
28            nn.Linear(latent_dim, 256),
29            nn.LeakyReLU(0.2),
30            nn.BatchNorm1d(256),
31            nn.Linear(256, 512),
32            nn.LeakyReLU(0.2),
33            nn.BatchNorm1d(512),
34            nn.Linear(512, 1024),
35            nn.LeakyReLU(0.2),
36            nn.BatchNorm1d(1024),
37            nn.Linear(1024, img_dim),
38            nn.Tanh(),
39        )
40
41    def forward(self, z):
42        return self.net(z).view(-1, 1, 28, 28)
43
44# Discriminator
45class Discriminator(nn.Module):
46    def __init__(self):
47        super().__init__()
48        self.net = nn.Sequential(
49            nn.Linear(img_dim, 512),
50            nn.LeakyReLU(0.2),
51            nn.Dropout(0.3),
52            nn.Linear(512, 256),
53            nn.LeakyReLU(0.2),
54            nn.Dropout(0.3),
55            nn.Linear(256, 1),
56            nn.Sigmoid(),
57        )
58
59    def forward(self, img):
60        return self.net(img.view(-1, img_dim))
61
62device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
63G = Generator().to(device)
64D = Discriminator().to(device)
65opt_G = torch.optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
66opt_D = torch.optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))
67criterion = nn.BCELoss()
68
69# Training
70for epoch in range(epochs):
71    for real_imgs, _ in loader:
72        real_imgs = real_imgs.to(device)
73        batch = real_imgs.size(0)
74        real_labels = torch.ones(batch, 1, device=device)
75        fake_labels = torch.zeros(batch, 1, device=device)
76
77        # ---- Train Discriminator ----
78        z = torch.randn(batch, latent_dim, device=device)
79        fake_imgs = G(z).detach()
80        d_loss = (criterion(D(real_imgs), real_labels) +
81                  criterion(D(fake_imgs), fake_labels)) / 2
82        opt_D.zero_grad()
83        d_loss.backward()
84        opt_D.step()
85
86        # ---- Train Generator ----
87        z = torch.randn(batch, latent_dim, device=device)
88        fake_imgs = G(z)
89        g_loss = criterion(D(fake_imgs), real_labels)  # fool D
90        opt_G.zero_grad()
91        g_loss.backward()
92        opt_G.step()
93
94    if (epoch + 1) % 10 == 0:
95        print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss:.4f} | G Loss: {g_loss:.4f}")
96
97# Generate samples
98with torch.no_grad():
99    z = torch.randn(16, latent_dim, device=device)
100    samples = G(z).cpu()
101    grid = torchvision.utils.make_grid(samples, nrow=4, normalize=True)
102    import matplotlib.pyplot as plt
103    plt.figure(figsize=(6, 6))
104    plt.imshow(grid.permute(1, 2, 0).squeeze(), cmap="gray")
105    plt.title("Generated MNIST Digits")
106    plt.axis("off")
107    plt.show()

python

1# ==============================================================
2# Stable Diffusion with diffusers library
3# pip install diffusers transformers accelerate
4# ==============================================================
5from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
6from diffusers import StableDiffusionInpaintPipeline
7import torch
8from PIL import Image
9
10device = "cuda" if torch.cuda.is_available() else "cpu"
11
12# ---- Text-to-Image ----
13pipe = StableDiffusionPipeline.from_pretrained(
14    "stabilityai/stable-diffusion-2-1",
15    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
16)
17pipe = pipe.to(device)
18
19# Generate an image
20prompt = "A serene mountain lake at sunset, photorealistic, 4k"
21negative_prompt = "blurry, low quality, distorted"
22
23image = pipe(
24    prompt=prompt,
25    negative_prompt=negative_prompt,
26    num_inference_steps=50,
27    guidance_scale=7.5,
28    height=512,
29    width=512,
30).images[0]
31image.save("mountain_lake.png")
32
33# ---- Exploring guidance scale ----
34import matplotlib.pyplot as plt
35fig, axes = plt.subplots(1, 4, figsize=(20, 5))
36for i, scale in enumerate([1.0, 5.0, 7.5, 15.0]):
37    img = pipe(
38        prompt="A cat wearing a tiny top hat, oil painting",
39        guidance_scale=scale,
40        num_inference_steps=30,
41    ).images[0]
42    axes[i].imshow(img)
43    axes[i].set_title(f"CFG Scale = {scale}")
44    axes[i].axis("off")
45plt.suptitle("Effect of Classifier-Free Guidance Scale")
46plt.tight_layout()
47plt.show()
48
49# ---- Image-to-Image ----
50img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
51    "stabilityai/stable-diffusion-2-1",
52    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
53)
54img2img_pipe = img2img_pipe.to(device)
55
56init_image = Image.open("mountain_lake.png").resize((512, 512))
57result = img2img_pipe(
58    prompt="Same scene but in winter with snow, photorealistic",
59    image=init_image,
60    strength=0.75,  # 0 = no change, 1 = complete regeneration
61    guidance_scale=7.5,
62    num_inference_steps=50,
63).images[0]
64result.save("mountain_lake_winter.png")