Neural Architecture Search (NAS)
Designing neural network architectures has traditionally been a manual, expertise-intensive process. Neural Architecture Search (NAS) automates this by treating architecture design as a search problem: define a space of possible architectures, search through it efficiently, and evaluate candidates to find the best one.
NAS has produced some of the most successful architectures in deep learning, including NASNet, EfficientNet, and MobileNetV3.
The Three Pillars of NAS
Search Spaces
Cell-based Search Space
Instead of searching over entire architectures, search for a cell (a small building block) that is repeated to form the full network. This dramatically reduces the search space.Operation Space
Each edge in the cell can be one of several operations:Macro vs Micro Search
| Approach | Searches for | Space size | Cost |
|---|---|---|---|
| Macro | Entire network topology | Enormous | Very high |
| Micro (cell-based) | Cell structure only | Small | Manageable |
| Hierarchical | Both cell and network | Medium | Medium |
Search Strategies
Random Search
Surprisingly competitive baseline. Randomly sample architectures and evaluate them. Works well because many architectures in a well-designed search space perform similarly.Reinforcement Learning (NASNet, 2017)
An RNN controller generates architecture descriptions. The controller is trained with REINFORCE, using the validation accuracy of each generated architecture as the reward signal. Very expensive: the original NASNet search used 500 GPUs for 4 days.Evolutionary Methods (AmoebaNet, 2018)
Maintain a population of architectures. In each generation: 1. Select parent architectures (tournament selection) 2. Mutate (add/remove/change operations) 3. Evaluate offspring 4. Replace weakest members of the populationEvolutionary methods are more sample-efficient than RL and naturally explore diverse architectures.
Differentiable NAS (DARTS, 2019)
The breakthrough that made NAS practical. Instead of discrete search, make the search space continuous: 1. Place all possible operations on every edge (mixed operation) 2. Weight each operation with a learnable architecture parameter alpha 3. Optimize architecture parameters and model weights jointly using gradient descent 4. After search, discretize: keep the operation with the highest alpha on each edgeDARTS reduces search cost from thousands of GPU-hours to a single GPU-day.
1# === NAS Concepts: Search Space & Evaluation ===
2import numpy as np
3import torch
4import torch.nn as nn
5import torch.nn.functional as F
6
7# --- Define a simple cell-based search space ---
8OPERATIONS = {
9 "conv3x3": lambda C: nn.Sequential(
10 nn.Conv2d(C, C, 3, padding=1, bias=False),
11 nn.BatchNorm2d(C), nn.ReLU()
12 ),
13 "conv5x5": lambda C: nn.Sequential(
14 nn.Conv2d(C, C, 5, padding=2, bias=False),
15 nn.BatchNorm2d(C), nn.ReLU()
16 ),
17 "sep_conv3x3": lambda C: nn.Sequential(
18 nn.Conv2d(C, C, 3, padding=1, groups=C, bias=False),
19 nn.Conv2d(C, C, 1, bias=False),
20 nn.BatchNorm2d(C), nn.ReLU()
21 ),
22 "max_pool3x3": lambda C: nn.MaxPool2d(3, stride=1, padding=1),
23 "avg_pool3x3": lambda C: nn.AvgPool2d(3, stride=1, padding=1),
24 "skip": lambda C: nn.Identity(),
25 "zero": lambda C: Zero(),
26}
27
28class Zero(nn.Module):
29 """Zero operation (no connection)."""
30 def forward(self, x):
31 return torch.zeros_like(x)
32
33class NASCell(nn.Module):
34 """A cell with a specific architecture (list of operations)."""
35 def __init__(self, channels, ops_config):
36 super().__init__()
37 self.ops = nn.ModuleList([
38 OPERATIONS[op](channels) for op in ops_config
39 ])
40
41 def forward(self, x):
42 outputs = [op(x) for op in self.ops]
43 return sum(outputs)
44
45class NASNetwork(nn.Module):
46 """Full network built by stacking cells."""
47 def __init__(self, n_cells, channels, ops_config, n_classes=10):
48 super().__init__()
49 self.stem = nn.Sequential(
50 nn.Conv2d(3, channels, 3, padding=1, bias=False),
51 nn.BatchNorm2d(channels), nn.ReLU()
52 )
53 self.cells = nn.ModuleList([
54 NASCell(channels, ops_config) for _ in range(n_cells)
55 ])
56 self.pool = nn.AdaptiveAvgPool2d(1)
57 self.classifier = nn.Linear(channels, n_classes)
58
59 def forward(self, x):
60 x = self.stem(x)
61 for cell in self.cells:
62 x = cell(x)
63 x = self.pool(x).flatten(1)
64 return self.classifier(x)
65
66 def count_params(self):
67 return sum(p.numel() for p in self.parameters())
68
69
70# --- Random Architecture Search ---
71def random_architecture(n_ops=4):
72 """Generate a random cell configuration."""
73 ops = list(OPERATIONS.keys())
74 return [np.random.choice(ops) for _ in range(n_ops)]
75
76def evaluate_architecture(ops_config, n_cells=3, channels=16):
77 """Quick evaluation of an architecture (parameter count + dummy forward)."""
78 model = NASNetwork(n_cells, channels, ops_config)
79 params = model.count_params()
80
81 # Measure forward pass time
82 x = torch.randn(1, 3, 32, 32)
83 import time
84 start = time.time()
85 with torch.no_grad():
86 for _ in range(10):
87 model(x)
88 latency = (time.time() - start) / 10 * 1000 # ms
89
90 return params, latency
91
92# --- Search ---
93np.random.seed(42)
94n_candidates = 20
95
96print("=== Random Architecture Search ===")
97print(f"Operations: {list(OPERATIONS.keys())}")
98print(f"Evaluating {n_candidates} random architectures...\n")
99
100results = []
101for i in range(n_candidates):
102 ops = random_architecture(n_ops=4)
103 params, latency = evaluate_architecture(ops)
104 results.append({
105 "id": i, "ops": ops, "params": params, "latency_ms": latency,
106 })
107
108# Sort by efficiency (params * latency)
109results.sort(key=lambda x: x["params"] * x["latency_ms"])
110
111print(f"{'Rank':<6} {'Params':>10} {'Latency':>10} {'Operations'}")
112print("-" * 60)
113for rank, r in enumerate(results[:10], 1):
114 ops_str = ", ".join(r["ops"])
115 print(f"{rank:<6} {r['params']:>10,} {r['latency_ms']:>9.2f}ms "
116 f"{ops_str}")
117
118print(f"\nMost efficient: {results[0]['ops']}")
119print(f"Least efficient: {results[-1]['ops']}")
120print(f"\nParam range: {results[0]['params']:,} - {results[-1]['params']:,}")EfficientNet: Compound Scaling
EfficientNet (Tan & Le, 2019) addresses a key question: given a fixed compute budget, how should you scale a network? Prior work scaled one dimension at a time (depth, width, or resolution). EfficientNet scales all three simultaneously using a compound coefficient.
The Compound Scaling Method
Given a baseline architecture, scale with coefficient phi:
Where alpha, beta, gamma are constants found via grid search such that: alpha * beta^2 * gamma^2 ~= 2 (doubling compute per step)
For EfficientNet-B0 (baseline): alpha=1.2, beta=1.1, gamma=1.15
| Model | Phi | Params | Top-1 Acc | FLOPs |
|---|---|---|---|---|
| B0 | 0 | 5.3M | 77.1% | 0.39B |
| B1 | 1 | 7.8M | 79.1% | 0.70B |
| B3 | 3 | 12M | 81.6% | 1.8B |
| B5 | 5 | 30M | 83.6% | 9.9B |
| B7 | 7 | 66M | 84.3% | 37B |
Hardware-Aware NAS
Modern NAS incorporates hardware constraints directly:
Once-for-All Networks
Instead of searching separately for each hardware target: 1. Train a single supernet that supports variable depth, width, and resolution 2. Use progressive shrinking: first train the largest network, then gradually allow smaller sub-networks 3. At deployment time, search for the best sub-network that fits the target hardware constraintsThis amortizes the training cost: one training run supports deployment to phones, tablets, servers, and IoT devices.
1# === Compound Scaling (EfficientNet Style) ===
2import numpy as np
3import torch
4import torch.nn as nn
5import time
6
7def make_network(depth_mult, width_mult, resolution, base_channels=32,
8 base_depth=3, n_classes=10):
9 """Create a simple CNN with configurable scaling."""
10 channels = int(base_channels * width_mult)
11 depth = int(base_depth * depth_mult)
12
13 layers = [
14 nn.Conv2d(3, channels, 3, padding=1, bias=False),
15 nn.BatchNorm2d(channels),
16 nn.ReLU(),
17 ]
18
19 for _ in range(depth):
20 layers.extend([
21 nn.Conv2d(channels, channels, 3, padding=1, bias=False),
22 nn.BatchNorm2d(channels),
23 nn.ReLU(),
24 ])
25
26 layers.extend([
27 nn.AdaptiveAvgPool2d(1),
28 nn.Flatten(),
29 nn.Linear(channels, n_classes),
30 ])
31
32 return nn.Sequential(*layers)
33
34def measure_model(model, resolution, n_runs=20):
35 """Measure model parameters and latency."""
36 params = sum(p.numel() for p in model.parameters())
37 x = torch.randn(1, 3, resolution, resolution)
38
39 model.eval()
40 # Warmup
41 with torch.no_grad():
42 for _ in range(5):
43 model(x)
44
45 start = time.time()
46 with torch.no_grad():
47 for _ in range(n_runs):
48 model(x)
49 latency = (time.time() - start) / n_runs * 1000
50
51 return params, latency
52
53# === Compound Scaling Experiments ===
54# EfficientNet constants (simplified)
55alpha = 1.2 # depth multiplier base
56beta = 1.1 # width multiplier base
57gamma = 1.15 # resolution multiplier base
58
59base_resolution = 32
60
61print("=== Compound Scaling (EfficientNet-Style) ===\n")
62print(f"Scaling constants: alpha={alpha}, beta={beta}, gamma={gamma}")
63print(f"Base: depth=3, width=32, resolution={base_resolution}\n")
64
65# Compare scaling strategies
66strategies = {
67 "Depth only": [],
68 "Width only": [],
69 "Resolution only": [],
70 "Compound": [],
71}
72
73for phi in range(5):
74 # Depth only
75 d, w, r = alpha**phi, 1.0, base_resolution
76 model = make_network(d, w, r)
77 params, lat = measure_model(model, r)
78 strategies["Depth only"].append((phi, params, lat))
79
80 # Width only
81 d, w, r = 1.0, beta**phi, base_resolution
82 model = make_network(d, w, r)
83 params, lat = measure_model(model, r)
84 strategies["Width only"].append((phi, params, lat))
85
86 # Resolution only
87 d, w, r = 1.0, 1.0, int(base_resolution * gamma**phi)
88 model = make_network(d, w, r)
89 params, lat = measure_model(model, r)
90 strategies["Resolution only"].append((phi, params, lat))
91
92 # Compound (all three)
93 d, w, r = alpha**phi, beta**phi, int(base_resolution * gamma**phi)
94 model = make_network(d, w, r)
95 params, lat = measure_model(model, r)
96 strategies["Compound"].append((phi, params, lat))
97
98# Print comparison
99for strategy_name, results in strategies.items():
100 print(f"--- {strategy_name} ---")
101 print(f"{'Phi':>4} {'Params':>10} {'Latency':>10} {'Efficiency':>12}")
102 for phi, params, lat in results:
103 # Efficiency = params per ms of latency (lower is better)
104 eff = params / lat if lat > 0 else 0
105 print(f"{phi:>4} {params:>10,} {lat:>9.2f}ms {eff:>11,.0f} p/ms")
106 print()
107
108# Summary
109print("=== Scaling Summary (at phi=4) ===")
110print(f"{'Strategy':<20} {'Params':>10} {'Latency':>10}")
111print("-" * 42)
112for name, results in strategies.items():
113 _, params, lat = results[4]
114 print(f"{name:<20} {params:>10,} {lat:>9.2f}ms")
115
116print("\nCompound scaling achieves a balanced tradeoff between")
117print("model capacity (params) and computational cost (latency).")Practical NAS Today
Practical NAS Tools
Optuna
General-purpose hyperparameter optimization that works well for architecture search:def objective(trial):
n_layers = trial.suggest_int("n_layers", 2, 8)
hidden = trial.suggest_int("hidden", 32, 256)
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
# Build and train model, return validation accuracy
NNI (Microsoft)
Neural Network Intelligence toolkit with built-in NAS support:AutoKeras
NAS for Keras/TensorFlow with a simple API:import autokeras as ak
clf = ak.ImageClassifier(max_trials=10)
clf.fit(x_train, y_train)