Skip to main content

Generative Models for Images

GANs, VAEs, diffusion models, Stable Diffusion, and ControlNet

~55 min
Listen to this lesson

Generative Models for Images

Generative models learn to create new data samples that resemble the training distribution. In computer vision, this means generating realistic images.

Generative Adversarial Networks (GANs)

Introduced by Ian Goodfellow in 2014, GANs train two networks in a minimax game:

Architecture

  • Generator (G): Takes random noise z and produces a fake image G(z)
  • Discriminator (D): Takes an image and classifies it as real or fake
  • Training Dynamics

    The generator tries to fool the discriminator; the discriminator tries not to be fooled:

    $$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

    Training alternates: 1. Train D: Show real images (label=1) and fake images (label=0), maximize classification accuracy 2. Train G: Generate fake images, try to make D classify them as real (label=1)

    Mode Collapse

    The biggest challenge in GAN training. The generator finds a few outputs that fool the discriminator and keeps producing only those, ignoring the full diversity of the data distribution.

    Mitigation techniques:

  • Wasserstein GAN (WGAN): Uses Wasserstein distance instead of JS divergence
  • Spectral normalization: Stabilizes discriminator training
  • Progressive growing: Start with low resolution, gradually increase
  • Minibatch discrimination: Let D see batches of images, not just individual ones
  • Notable GAN Architectures

  • DCGAN: First stable CNN-based GAN (batch norm, no pooling, transposed convolutions)
  • StyleGAN/StyleGAN2/3: State-of-the-art face generation (style-based generator with AdaIN)
  • Pix2pix: Paired image-to-image translation
  • CycleGAN: Unpaired image-to-image translation (horse↔zebra, summer↔winter)
  • GANs vs VAEs vs Diffusion Models

    **GANs**: Adversarial training (generator vs discriminator). Fast sampling, but training is unstable and prone to mode collapse. **VAEs**: Encode to a latent distribution, decode back. Stable training with a principled loss (reconstruction + KL divergence), but outputs tend to be blurry. **Diffusion Models**: Gradually add noise, then learn to reverse the process. Highest quality outputs, but slow sampling (many denoising steps). Currently dominant for image generation.

    Variational Autoencoders (VAEs)

    VAEs combine autoencoders with probabilistic inference:

    Architecture

  • Encoder: Maps input x to a distribution in latent space: q(z|x) = N(\u03bc, \u03c3)
  • Decoder: Maps a latent sample z back to data space: p(x|z)
  • The Reparameterization Trick

    Instead of sampling z ~ N(\u03bc, \u03c3) (which is not differentiable), compute: $$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, 1)$$ This makes the sampling operation differentiable, allowing backpropagation through the encoder.

    Loss Function

    $$L = L_{\text{reconstruction}} + \beta \cdot L_{\text{KL}}$$

  • Reconstruction loss: How well the decoder reconstructs the input (MSE or BCE)
  • KL divergence: How close the learned latent distribution is to a standard normal N(0, 1)
  • $$D_{KL}(q(z|x) \| p(z)) = -\frac{1}{2}\sum(1 + \log \sigma^2 - \mu^2 - \sigma^2)$$

    Latent Space Properties

  • Continuity: Nearby points in latent space produce similar images
  • Interpolation: Walking between two latent points produces a smooth transition
  • Manipulation: Arithmetic in latent space (e.g., smiling face - neutral face + another face = smiling version)
  • Diffusion Models

    The current state of the art for image generation. Based on non-equilibrium thermodynamics.

    Forward Process (Adding Noise)

    Gradually add Gaussian noise to an image over T timesteps: $$q(x_t | x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

    After enough steps, the image becomes pure noise. The noise schedule {\u03b2_t} controls how fast noise is added (linear, cosine, etc.).

    Reverse Process (Denoising)

    Learn a neural network \u03b5_\u03b8 to predict the noise added at each step: $$p_\theta(x_{t-1} | x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$

    The model is trained with a simple MSE loss: $$L = \mathbb{E}_{t, x_0, \epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$$

    The U-Net Denoiser

    The noise predictor is typically a U-Net with:
  • Time embedding: Sinusoidal position encoding of the timestep t, injected into each block
  • Self-attention layers: For global context and coherence
  • Cross-attention layers: For conditioning on text (in text-to-image models)
  • Skip connections: Standard U-Net encoder-decoder architecture
  • Stable Diffusion Architecture

    Stable Diffusion (Latent Diffusion Model) is the most widely used open-source image generation model:

    Key Innovation: Latent Space Diffusion

    Instead of diffusing in pixel space (512x512x3 = 786K dimensions), Stable Diffusion operates in a compressed latent space (64x64x4 = 16K dimensions):

    1. VAE Encoder: Compresses image from pixel space to latent space (8x spatial compression) 2. U-Net: Performs the denoising diffusion in latent space 3. VAE Decoder: Decompresses the denoised latent back to pixel space

    This makes training and inference ~50x more efficient than pixel-space diffusion.

    Text Conditioning

  • Text prompt is encoded by a text encoder (CLIP or T5)
  • Text embeddings are injected into the U-Net via cross-attention layers
  • This allows the model to generate images matching text descriptions
  • Classifier-Free Guidance (CFG)

    At inference, compute both conditional and unconditional predictions: $$\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$$
  • Scale s (guidance scale) controls how strongly the output follows the text prompt
  • s=1: No guidance (like unconditional generation)
  • s=7-8: Good balance (typical default)
  • s>15: Very strong adherence but may reduce quality/diversity
  • ControlNet

    Adds spatial control to diffusion models:
  • Takes an additional "control" input (edge map, depth map, pose skeleton, etc.)
  • Copies the U-Net encoder and trains it on paired data
  • Zero convolution layers ensure training starts from the pretrained model
  • Enables precise spatial control while maintaining generation quality
  • Other Techniques

  • Image-to-Image (img2img): Start denoising from a noisy version of an input image instead of pure noise. The noise strength controls how much the output diverges from the input.
  • Inpainting: Mask part of an image and regenerate only the masked region. The model fills in the masked area while keeping the rest of the image intact.
  • LoRA (Low-Rank Adaptation): Fine-tune a diffusion model on a small dataset by training only low-rank weight updates. Enables personalization (custom styles, faces, objects) with minimal compute.
  • python
    1# ==============================================================
    2# Simple GAN for MNIST in PyTorch
    3# ==============================================================
    4import torch
    5import torch.nn as nn
    6import torchvision
    7import torchvision.transforms as T
    8from torch.utils.data import DataLoader
    9
    10# Hyperparameters
    11latent_dim = 100
    12img_dim = 28 * 28  # Flattened MNIST
    13batch_size = 128
    14lr = 2e-4
    15epochs = 50
    16
    17# Data
    18transform = T.Compose([T.ToTensor(), T.Normalize([0.5], [0.5])])
    19dataset = torchvision.datasets.MNIST("./data", train=True,
    20                                      download=True, transform=transform)
    21loader = DataLoader(dataset, batch_size, shuffle=True)
    22
    23# Generator
    24class Generator(nn.Module):
    25    def __init__(self):
    26        super().__init__()
    27        self.net = nn.Sequential(
    28            nn.Linear(latent_dim, 256),
    29            nn.LeakyReLU(0.2),
    30            nn.BatchNorm1d(256),
    31            nn.Linear(256, 512),
    32            nn.LeakyReLU(0.2),
    33            nn.BatchNorm1d(512),
    34            nn.Linear(512, 1024),
    35            nn.LeakyReLU(0.2),
    36            nn.BatchNorm1d(1024),
    37            nn.Linear(1024, img_dim),
    38            nn.Tanh(),
    39        )
    40
    41    def forward(self, z):
    42        return self.net(z).view(-1, 1, 28, 28)
    43
    44# Discriminator
    45class Discriminator(nn.Module):
    46    def __init__(self):
    47        super().__init__()
    48        self.net = nn.Sequential(
    49            nn.Linear(img_dim, 512),
    50            nn.LeakyReLU(0.2),
    51            nn.Dropout(0.3),
    52            nn.Linear(512, 256),
    53            nn.LeakyReLU(0.2),
    54            nn.Dropout(0.3),
    55            nn.Linear(256, 1),
    56            nn.Sigmoid(),
    57        )
    58
    59    def forward(self, img):
    60        return self.net(img.view(-1, img_dim))
    61
    62device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    63G = Generator().to(device)
    64D = Discriminator().to(device)
    65opt_G = torch.optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
    66opt_D = torch.optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))
    67criterion = nn.BCELoss()
    68
    69# Training
    70for epoch in range(epochs):
    71    for real_imgs, _ in loader:
    72        real_imgs = real_imgs.to(device)
    73        batch = real_imgs.size(0)
    74        real_labels = torch.ones(batch, 1, device=device)
    75        fake_labels = torch.zeros(batch, 1, device=device)
    76
    77        # ---- Train Discriminator ----
    78        z = torch.randn(batch, latent_dim, device=device)
    79        fake_imgs = G(z).detach()
    80        d_loss = (criterion(D(real_imgs), real_labels) +
    81                  criterion(D(fake_imgs), fake_labels)) / 2
    82        opt_D.zero_grad()
    83        d_loss.backward()
    84        opt_D.step()
    85
    86        # ---- Train Generator ----
    87        z = torch.randn(batch, latent_dim, device=device)
    88        fake_imgs = G(z)
    89        g_loss = criterion(D(fake_imgs), real_labels)  # fool D
    90        opt_G.zero_grad()
    91        g_loss.backward()
    92        opt_G.step()
    93
    94    if (epoch + 1) % 10 == 0:
    95        print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss:.4f} | G Loss: {g_loss:.4f}")
    96
    97# Generate samples
    98with torch.no_grad():
    99    z = torch.randn(16, latent_dim, device=device)
    100    samples = G(z).cpu()
    101    grid = torchvision.utils.make_grid(samples, nrow=4, normalize=True)
    102    import matplotlib.pyplot as plt
    103    plt.figure(figsize=(6, 6))
    104    plt.imshow(grid.permute(1, 2, 0).squeeze(), cmap="gray")
    105    plt.title("Generated MNIST Digits")
    106    plt.axis("off")
    107    plt.show()
    python
    1# ==============================================================
    2# Stable Diffusion with diffusers library
    3# pip install diffusers transformers accelerate
    4# ==============================================================
    5from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
    6from diffusers import StableDiffusionInpaintPipeline
    7import torch
    8from PIL import Image
    9
    10device = "cuda" if torch.cuda.is_available() else "cpu"
    11
    12# ---- Text-to-Image ----
    13pipe = StableDiffusionPipeline.from_pretrained(
    14    "stabilityai/stable-diffusion-2-1",
    15    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    16)
    17pipe = pipe.to(device)
    18
    19# Generate an image
    20prompt = "A serene mountain lake at sunset, photorealistic, 4k"
    21negative_prompt = "blurry, low quality, distorted"
    22
    23image = pipe(
    24    prompt=prompt,
    25    negative_prompt=negative_prompt,
    26    num_inference_steps=50,
    27    guidance_scale=7.5,
    28    height=512,
    29    width=512,
    30).images[0]
    31image.save("mountain_lake.png")
    32
    33# ---- Exploring guidance scale ----
    34import matplotlib.pyplot as plt
    35fig, axes = plt.subplots(1, 4, figsize=(20, 5))
    36for i, scale in enumerate([1.0, 5.0, 7.5, 15.0]):
    37    img = pipe(
    38        prompt="A cat wearing a tiny top hat, oil painting",
    39        guidance_scale=scale,
    40        num_inference_steps=30,
    41    ).images[0]
    42    axes[i].imshow(img)
    43    axes[i].set_title(f"CFG Scale = {scale}")
    44    axes[i].axis("off")
    45plt.suptitle("Effect of Classifier-Free Guidance Scale")
    46plt.tight_layout()
    47plt.show()
    48
    49# ---- Image-to-Image ----
    50img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    51    "stabilityai/stable-diffusion-2-1",
    52    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    53)
    54img2img_pipe = img2img_pipe.to(device)
    55
    56init_image = Image.open("mountain_lake.png").resize((512, 512))
    57result = img2img_pipe(
    58    prompt="Same scene but in winter with snow, photorealistic",
    59    image=init_image,
    60    strength=0.75,  # 0 = no change, 1 = complete regeneration
    61    guidance_scale=7.5,
    62    num_inference_steps=50,
    63).images[0]
    64result.save("mountain_lake_winter.png")

    GPU Memory Requirements

    Stable Diffusion models require significant GPU memory: - **SD 1.5/2.1**: ~4-6 GB VRAM with float16 - **SDXL**: ~8-12 GB VRAM with float16 - **SD 3.0**: ~12-16 GB VRAM To reduce memory usage, use `pipe.enable_attention_slicing()` or `pipe.enable_model_cpu_offload()`. For CPU-only systems, use float32 but expect generation to take several minutes per image.