Tensors & Automatic Differentiation

PyTorch is built around tensors — multi-dimensional arrays similar to NumPy's ndarrays, but with two superpowers:

1. GPU acceleration — tensors can live on CUDA-capable GPUs for massively parallel computation. 2. Automatic differentiation — PyTorch tracks every operation on tensors and can compute gradients automatically.

These two features make PyTorch the backbone of modern deep learning research and production systems.

Creating Tensors

There are many ways to create tensors in PyTorch:

python

1import torch
2
3# From Python lists
4a = torch.tensor([1, 2, 3])
5b = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
6
7# Common factory functions
8zeros = torch.zeros(3, 4)          # 3x4 matrix of zeros
9ones = torch.ones(2, 3, 5)        # 2x3x5 tensor of ones
10rand = torch.rand(4, 4)           # uniform random [0, 1)
11randn = torch.randn(4, 4)         # standard normal distribution
12arange = torch.arange(0, 10, 2)   # [0, 2, 4, 6, 8]
13linspace = torch.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
14eye = torch.eye(3)                 # 3x3 identity matrix
15
16# From NumPy (shared memory — changes to one affect the other!)
17import numpy as np
18np_array = np.array([1.0, 2.0, 3.0])
19from_numpy = torch.from_numpy(np_array)
20
21# Specifying dtype
22x = torch.tensor([1, 2, 3], dtype=torch.float32)
23y = torch.zeros(3, dtype=torch.int64)
24
25print(f"Shape: {b.shape}, Dtype: {b.dtype}, Device: {b.device}")
26# Shape: torch.Size([2, 2]), Dtype: torch.float32, Device: cpu

Tensor Data Types (dtypes)

Common dtypes: torch.float32 (default for most operations), torch.float64 (double precision), torch.float16 (half precision, for mixed-precision training), torch.int64 (default for integer tensors), torch.bool. You can cast with tensor.to(dtype) or tensor.float(), tensor.long(), etc.

Tensor Operations

PyTorch tensors support a rich set of operations for reshaping, slicing, and computation:

python

1import torch
2
3x = torch.arange(12, dtype=torch.float32)
4
5# --- Reshaping ---
6a = x.reshape(3, 4)       # Reshape to 3x4 (may copy)
7b = x.view(3, 4)          # Reshape to 3x4 (requires contiguous memory)
8c = x.reshape(2, -1)      # -1 means "infer this dimension" -> 2x6
9
10# --- Squeeze / Unsqueeze ---
11t = torch.zeros(1, 3, 1, 4)
12print(t.shape)                    # torch.Size([1, 3, 1, 4])
13print(t.squeeze().shape)          # torch.Size([3, 4]) — removes all size-1 dims
14print(t.squeeze(0).shape)         # torch.Size([3, 1, 4]) — removes dim 0 only
15
16u = torch.zeros(3, 4)
17print(u.unsqueeze(0).shape)       # torch.Size([1, 3, 4]) — add dim at position 0
18print(u.unsqueeze(-1).shape)      # torch.Size([3, 4, 1]) — add dim at end
19
20# --- Indexing and Slicing (NumPy-style) ---
21m = torch.arange(12).reshape(3, 4)
22print(m[0])           # First row: tensor([0, 1, 2, 3])
23print(m[:, 1])        # Second column: tensor([1, 5, 9])
24print(m[1:, :2])      # Rows 1+, first 2 cols: tensor([[4, 5], [8, 9]])
25
26# --- Transpose and Permute ---
27t = torch.randn(2, 3, 4)
28print(t.transpose(0, 2).shape)    # torch.Size([4, 3, 2])
29print(t.permute(2, 0, 1).shape)   # torch.Size([4, 2, 3])
30
31# --- Concatenation and Stacking ---
32a = torch.ones(2, 3)
33b = torch.zeros(2, 3)
34cat = torch.cat([a, b], dim=0)     # Shape: [4, 3]
35stack = torch.stack([a, b], dim=0)  # Shape: [2, 2, 3]
36
37# --- Math operations ---
38x = torch.tensor([1.0, 2.0, 3.0])
39y = torch.tensor([4.0, 5.0, 6.0])
40print(x + y)               # Element-wise addition
41print(x * y)               # Element-wise multiplication
42print(x @ y)               # Dot product: tensor(32.)
43print(torch.matmul(x.unsqueeze(0), y.unsqueeze(1)))  # Matrix multiply

.view() vs .reshape()

.view() requires the tensor to be contiguous in memory and never copies data — it just reinterprets the same memory. .reshape() works on any tensor and will copy if needed. Use .view() when you know the tensor is contiguous (most of the time) for a guaranteed zero-copy operation. If you get a 'not contiguous' error, call .contiguous() first or use .reshape().

GPU Operations

PyTorch makes it easy to move computations to GPU:

python

1import torch
2
3# Check GPU availability
4device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
5print(f"Using device: {device}")
6
7# Move tensors to GPU
8x = torch.randn(1000, 1000)
9x_gpu = x.to(device)          # Move to GPU (or stay on CPU)
10x_gpu = x.cuda()              # Explicitly move to CUDA (errors if no GPU)
11
12# Create directly on GPU
13y = torch.randn(1000, 1000, device=device)
14
15# Operations between tensors must be on the same device!
16z = x_gpu @ y                 # Both on same device — works
17# z = x @ x_gpu              # ERROR: can't mix CPU and GPU tensors
18
19# Move back to CPU (e.g., for NumPy conversion)
20result = z.cpu().numpy()
21
22# For Apple Silicon Macs:
23if torch.backends.mps.is_available():
24    mps_device = torch.device("mps")
25    x_mps = x.to(mps_device)

Device Mismatches

You cannot perform operations between tensors on different devices. Always ensure all tensors in a computation are on the same device. A common pattern: define device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') at the top of your script, then use .to(device) consistently.

Automatic Differentiation (Autograd)

This is the magic that makes training neural networks possible. PyTorch builds a computational graph dynamically as you perform operations, then uses it to compute gradients via backpropagation.

python

1import torch
2
3# --- Basic autograd ---
4# requires_grad=True tells PyTorch to track operations for gradient computation
5x = torch.tensor(3.0, requires_grad=True)
6
7# Forward pass: compute y = x^2 + 2x + 1
8y = x**2 + 2*x + 1
9
10# Backward pass: compute dy/dx
11y.backward()
12
13# The gradient is stored in x.grad
14print(x.grad)  # tensor(8.0)  because dy/dx = 2x + 2 = 2(3) + 2 = 8
15
16# --- With vectors ---
17x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
18y = (x ** 2).sum()   # Need a scalar for .backward()
19y.backward()
20print(x.grad)         # tensor([2., 4., 6.])  — dy/dx_i = 2*x_i
21
22# --- Gradient accumulation (important!) ---
23x = torch.tensor(2.0, requires_grad=True)
24
25y1 = x ** 2
26y1.backward()
27print(x.grad)         # tensor(4.0)
28
29y2 = x ** 3
30y2.backward()
31print(x.grad)         # tensor(16.0) — gradients ACCUMULATE! 4 + 12 = 16
32
33# Always zero gradients before a new computation
34x.grad.zero_()
35y3 = x ** 3
36y3.backward()
37print(x.grad)         # tensor(12.0) — now correct

Computational Graph

Every operation on tensors with requires_grad=True is recorded in a directed acyclic graph (DAG). Leaf nodes are input tensors, internal nodes are operations, and the root is the output. When you call .backward(), PyTorch traverses this graph in reverse (backpropagation) using the chain rule to compute all gradients. The graph is destroyed after .backward() by default (set retain_graph=True to keep it).

Detach and No-Grad

Sometimes you need to stop gradient tracking:

python

1import torch
2
3x = torch.tensor(3.0, requires_grad=True)
4y = x ** 2
5
6# .detach() creates a new tensor that shares data but has no gradient history
7z = y.detach()
8print(z.requires_grad)   # False
9# z is a "view" of y's data, but gradient won't flow through it
10
11# torch.no_grad() context manager — disables gradient computation entirely
12# Use during inference for speed and memory savings
13x = torch.randn(1000, 1000, requires_grad=True)
14with torch.no_grad():
15    y = x @ x.T          # No computational graph built
16    print(y.requires_grad)  # False
17
18# Common pattern: evaluation / inference
19model = ...  # some nn.Module
20model.eval()
21with torch.no_grad():
22    predictions = model(test_data)
23
24# torch.inference_mode() — even faster than no_grad (PyTorch 1.9+)
25with torch.inference_mode():
26    predictions = model(test_data)

When to use no_grad vs detach

Use torch.no_grad() (or torch.inference_mode()) when you want to disable gradient tracking for an entire block of code — typically during validation or inference. Use .detach() when you want to extract a specific tensor from the computational graph while keeping gradients flowing elsewhere, such as when computing auxiliary metrics during training.

Tensors & Automatic Differentiation

PyTorch is built around tensors — multi-dimensional arrays similar to NumPy's ndarrays, but with two superpowers:

These two features make PyTorch the backbone of modern deep learning research and production systems.

1import torch 2 3# From Python lists 4a = torch.tensor([1, 2, 3]) 5b = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) 6 7# Common factory functions 8zeros = torch.zeros(3, 4) # 3x4 matrix of zeros 9ones = torch.ones(2, 3, 5) # 2x3x5 tensor of ones 10rand = torch.rand(4, 4) # uniform random [0, 1) 11randn = torch.randn(4, 4) # standard normal distribution 12arange = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8] 13linspace = torch.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0] 14eye = torch.eye(3) # 3x3 identity matrix 15 16# From NumPy (shared memory — changes to one affect the other!) 17import numpy as np 18np_array = np.array([1.0, 2.0, 3.0]) 19from_numpy = torch.from_numpy(np_array) 20 21# Specifying dtype 22x = torch.tensor([1, 2, 3], dtype=torch.float32) 23y = torch.zeros(3, dtype=torch.int64) 24 25print(f"Shape: {b.shape}, Dtype: {b.dtype}, Device: {b.device}") 26# Shape: torch.Size([2, 2]), Dtype: torch.float32, Device: cpu

1import torch 2 3x = torch.arange(12, dtype=torch.float32) 4 5# --- Reshaping --- 6a = x.reshape(3, 4) # Reshape to 3x4 (may copy) 7b = x.view(3, 4) # Reshape to 3x4 (requires contiguous memory) 8c = x.reshape(2, -1) # -1 means "infer this dimension" -> 2x6 9 10# --- Squeeze / Unsqueeze --- 11t = torch.zeros(1, 3, 1, 4) 12print(t.shape) # torch.Size([1, 3, 1, 4]) 13print(t.squeeze().shape) # torch.Size([3, 4]) — removes all size-1 dims 14print(t.squeeze(0).shape) # torch.Size([3, 1, 4]) — removes dim 0 only 15 16u = torch.zeros(3, 4) 17print(u.unsqueeze(0).shape) # torch.Size([1, 3, 4]) — add dim at position 0 18print(u.unsqueeze(-1).shape) # torch.Size([3, 4, 1]) — add dim at end 19 20# --- Indexing and Slicing (NumPy-style) --- 21m = torch.arange(12).reshape(3, 4) 22print(m[0]) # First row: tensor([0, 1, 2, 3]) 23print(m[:, 1]) # Second column: tensor([1, 5, 9]) 24print(m[1:, :2]) # Rows 1+, first 2 cols: tensor([[4, 5], [8, 9]]) 25 26# --- Transpose and Permute --- 27t = torch.randn(2, 3, 4) 28print(t.transpose(0, 2).shape) # torch.Size([4, 3, 2]) 29print(t.permute(2, 0, 1).shape) # torch.Size([4, 2, 3]) 30 31# --- Concatenation and Stacking --- 32a = torch.ones(2, 3) 33b = torch.zeros(2, 3) 34cat = torch.cat([a, b], dim=0) # Shape: [4, 3] 35stack = torch.stack([a, b], dim=0) # Shape: [2, 2, 3] 36 37# --- Math operations --- 38x = torch.tensor([1.0, 2.0, 3.0]) 39y = torch.tensor([4.0, 5.0, 6.0]) 40print(x + y) # Element-wise addition 41print(x * y) # Element-wise multiplication 42print(x @ y) # Dot product: tensor(32.) 43print(torch.matmul(x.unsqueeze(0), y.unsqueeze(1))) # Matrix multiply

1import torch 2 3# Check GPU availability 4device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 5print(f"Using device: {device}") 6 7# Move tensors to GPU 8x = torch.randn(1000, 1000) 9x_gpu = x.to(device) # Move to GPU (or stay on CPU) 10x_gpu = x.cuda() # Explicitly move to CUDA (errors if no GPU) 11 12# Create directly on GPU 13y = torch.randn(1000, 1000, device=device) 14 15# Operations between tensors must be on the same device! 16z = x_gpu @ y # Both on same device — works 17# z = x @ x_gpu # ERROR: can't mix CPU and GPU tensors 18 19# Move back to CPU (e.g., for NumPy conversion) 20result = z.cpu().numpy() 21 22# For Apple Silicon Macs: 23if torch.backends.mps.is_available(): 24 mps_device = torch.device("mps") 25 x_mps = x.to(mps_device)

1import torch 2 3# --- Basic autograd --- 4# requires_grad=True tells PyTorch to track operations for gradient computation 5x = torch.tensor(3.0, requires_grad=True) 6 7# Forward pass: compute y = x^2 + 2x + 1 8y = x**2 + 2*x + 1 9 10# Backward pass: compute dy/dx 11y.backward() 12 13# The gradient is stored in x.grad 14print(x.grad) # tensor(8.0) because dy/dx = 2x + 2 = 2(3) + 2 = 8 15 16# --- With vectors --- 17x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) 18y = (x ** 2).sum() # Need a scalar for .backward() 19y.backward() 20print(x.grad) # tensor([2., 4., 6.]) — dy/dx_i = 2*x_i 21 22# --- Gradient accumulation (important!) --- 23x = torch.tensor(2.0, requires_grad=True) 24 25y1 = x ** 2 26y1.backward() 27print(x.grad) # tensor(4.0) 28 29y2 = x ** 3 30y2.backward() 31print(x.grad) # tensor(16.0) — gradients ACCUMULATE! 4 + 12 = 16 32 33# Always zero gradients before a new computation 34x.grad.zero_() 35y3 = x ** 3 36y3.backward() 37print(x.grad) # tensor(12.0) — now correct

1import torch 2 3x = torch.tensor(3.0, requires_grad=True) 4y = x ** 2 5 6# .detach() creates a new tensor that shares data but has no gradient history 7z = y.detach() 8print(z.requires_grad) # False 9# z is a "view" of y's data, but gradient won't flow through it 10 11# torch.no_grad() context manager — disables gradient computation entirely 12# Use during inference for speed and memory savings 13x = torch.randn(1000, 1000, requires_grad=True) 14with torch.no_grad(): 15 y = x @ x.T # No computational graph built 16 print(y.requires_grad) # False 17 18# Common pattern: evaluation / inference 19model = ... # some nn.Module 20model.eval() 21with torch.no_grad(): 22 predictions = model(test_data) 23 24# torch.inference_mode() — even faster than no_grad (PyTorch 1.9+) 25with torch.inference_mode(): 26 predictions = model(test_data)