Math for Machine Learning
You don't need a math PhD, but you need to understand the intuition behind these concepts.
Linear Algebra — The Language of Data
In ML, data is represented as vectors and matrices.
Vector: A list of numbers. Each data point is a vector of features.
[height, weight, age] = [5.9, 160, 32]
Matrix: A 2D array. Your entire dataset is a matrix. Rows = samples, Columns = features
Tensor: An N-dimensional array. Images are 3D tensors (height × width × channels). Batches of images are 4D tensors.
1import numpy as np
2
3# Vectors
4features = np.array([5.9, 160, 32])
5weights = np.array([0.3, 0.01, 0.05])
6
7# Dot product — the fundamental operation in neural networks
8# Each neuron computes: dot(inputs, weights) + bias
9prediction = np.dot(features, weights) # 5.9*0.3 + 160*0.01 + 32*0.05 = 4.97
10
11# Matrix multiplication — processing a whole batch at once
12# 3 samples, 4 features each
13X = np.array([
14 [1.0, 2.0, 3.0, 4.0],
15 [5.0, 6.0, 7.0, 8.0],
16 [9.0, 10., 11., 12.],
17])
18
19# Weight matrix: 4 inputs -> 2 outputs
20W = np.array([
21 [0.1, 0.2],
22 [0.3, 0.4],
23 [0.5, 0.6],
24 [0.7, 0.8],
25])
26
27# Matrix multiply: (3×4) @ (4×2) = (3×2)
28output = X @ W # Each row is the output for one sample
29print(output.shape) # (3, 2)Why Matrix Multiplication Matters
Calculus — How Models Learn
The key concept: gradients tell us how to adjust parameters to reduce error.
Derivative: How much an output changes when you tweak an input.
Gradient: A vector of partial derivatives — tells you the direction of steepest increase.
Gradient Descent: Move parameters in the opposite direction of the gradient to minimize loss.
New weight = Old weight - learning_rate × gradient
That's it. That's how neural networks learn.
1# Simple gradient descent from scratch
2import numpy as np
3
4def gradient_descent_demo():
5 # Simple linear regression: y = wx + b
6 # True values: w=2, b=1
7 np.random.seed(42)
8 X = np.random.randn(100)
9 y_true = 2 * X + 1 + np.random.randn(100) * 0.1
10
11 # Initialize random parameters
12 w, b = 0.0, 0.0
13 learning_rate = 0.1
14
15 for epoch in range(50):
16 # Forward pass: make predictions
17 y_pred = w * X + b
18
19 # Compute loss (Mean Squared Error)
20 loss = np.mean((y_pred - y_true) ** 2)
21
22 # Compute gradients (derivatives of loss w.r.t. w and b)
23 dw = 2 * np.mean((y_pred - y_true) * X)
24 db = 2 * np.mean(y_pred - y_true)
25
26 # Update parameters (gradient descent step)
27 w -= learning_rate * dw
28 b -= learning_rate * db
29
30 if epoch % 10 == 0:
31 print(f"Epoch {epoch}: loss={loss:.4f}, w={w:.4f}, b={b:.4f}")
32
33 print(f"\nLearned: w={w:.4f} (true=2.0), b={b:.4f} (true=1.0)")
34
35gradient_descent_demo()Statistics & Probability
Mean — Average value. Used everywhere (mean loss, mean accuracy).
Variance / Standard Deviation — How spread out data is. Critical for normalization.
Normal Distribution — Bell curve. Weight initialization, noise, many natural phenomena.
Bayes' Theorem — P(A|B) = P(B|A) × P(A) / P(B). Foundation of probabilistic ML.
Softmax — Converts raw numbers into probabilities (sums to 1). Used in classification output layers.
1import numpy as np
2
3# Softmax — converts logits to probabilities
4def softmax(x):
5 exp_x = np.exp(x - np.max(x)) # subtract max for numerical stability
6 return exp_x / exp_x.sum()
7
8logits = np.array([2.0, 1.0, 0.1])
9probs = softmax(logits)
10print(f"Probabilities: {probs}") # [0.659, 0.242, 0.099]
11print(f"Sum: {probs.sum()}") # 1.0
12
13# The highest logit gets the highest probability
14# This is how classification models make decisions