Math for Machine Learning

You don't need a math PhD, but you need to understand the intuition behind these concepts.

Linear Algebra — The Language of Data

In ML, data is represented as vectors and matrices.

Vector: A list of numbers. Each data point is a vector of features. [height, weight, age] = [5.9, 160, 32]

Matrix: A 2D array. Your entire dataset is a matrix. Rows = samples, Columns = features

Tensor: An N-dimensional array. Images are 3D tensors (height × width × channels). Batches of images are 4D tensors.

python

1import numpy as np
2
3# Vectors
4features = np.array([5.9, 160, 32])
5weights = np.array([0.3, 0.01, 0.05])
6
7# Dot product — the fundamental operation in neural networks
8# Each neuron computes: dot(inputs, weights) + bias
9prediction = np.dot(features, weights)  # 5.9*0.3 + 160*0.01 + 32*0.05 = 4.97
10
11# Matrix multiplication — processing a whole batch at once
12# 3 samples, 4 features each
13X = np.array([
14    [1.0, 2.0, 3.0, 4.0],
15    [5.0, 6.0, 7.0, 8.0],
16    [9.0, 10., 11., 12.],
17])
18
19# Weight matrix: 4 inputs -> 2 outputs
20W = np.array([
21    [0.1, 0.2],
22    [0.3, 0.4],
23    [0.5, 0.6],
24    [0.7, 0.8],
25])
26
27# Matrix multiply: (3×4) @ (4×2) = (3×2)
28output = X @ W  # Each row is the output for one sample
29print(output.shape)  # (3, 2)

Why Matrix Multiplication Matters

Every layer in a neural network is fundamentally a matrix multiplication followed by an activation function: output = activation(X @ W + b). Understanding this one operation gives you insight into how all neural networks work.

Calculus — How Models Learn

The key concept: gradients tell us how to adjust parameters to reduce error.

Derivative: How much an output changes when you tweak an input.

Gradient: A vector of partial derivatives — tells you the direction of steepest increase.

Gradient Descent: Move parameters in the opposite direction of the gradient to minimize loss.

New weight = Old weight - learning_rate × gradient

That's it. That's how neural networks learn.

python

1# Simple gradient descent from scratch
2import numpy as np
3
4def gradient_descent_demo():
5    # Simple linear regression: y = wx + b
6    # True values: w=2, b=1
7    np.random.seed(42)
8    X = np.random.randn(100)
9    y_true = 2 * X + 1 + np.random.randn(100) * 0.1
10
11    # Initialize random parameters
12    w, b = 0.0, 0.0
13    learning_rate = 0.1
14
15    for epoch in range(50):
16        # Forward pass: make predictions
17        y_pred = w * X + b
18
19        # Compute loss (Mean Squared Error)
20        loss = np.mean((y_pred - y_true) ** 2)
21
22        # Compute gradients (derivatives of loss w.r.t. w and b)
23        dw = 2 * np.mean((y_pred - y_true) * X)
24        db = 2 * np.mean(y_pred - y_true)
25
26        # Update parameters (gradient descent step)
27        w -= learning_rate * dw
28        b -= learning_rate * db
29
30        if epoch % 10 == 0:
31            print(f"Epoch {epoch}: loss={loss:.4f}, w={w:.4f}, b={b:.4f}")
32
33    print(f"\nLearned: w={w:.4f} (true=2.0), b={b:.4f} (true=1.0)")
34
35gradient_descent_demo()

Statistics & Probability

Mean — Average value. Used everywhere (mean loss, mean accuracy).

Variance / Standard Deviation — How spread out data is. Critical for normalization.

Normal Distribution — Bell curve. Weight initialization, noise, many natural phenomena.

Bayes' Theorem — P(A|B) = P(B|A) × P(A) / P(B). Foundation of probabilistic ML.

Softmax — Converts raw numbers into probabilities (sums to 1). Used in classification output layers.

python

1import numpy as np
2
3# Softmax — converts logits to probabilities
4def softmax(x):
5    exp_x = np.exp(x - np.max(x))  # subtract max for numerical stability
6    return exp_x / exp_x.sum()
7
8logits = np.array([2.0, 1.0, 0.1])
9probs = softmax(logits)
10print(f"Probabilities: {probs}")  # [0.659, 0.242, 0.099]
11print(f"Sum: {probs.sum()}")       # 1.0
12
13# The highest logit gets the highest probability
14# This is how classification models make decisions

Math for Machine Learning

You don't need a math PhD, but you need to understand the intuition behind these concepts.

Linear Algebra — The Language of Data

In ML, data is represented as vectors and matrices.

Vector: A list of numbers. Each data point is a vector of features. [height, weight, age] = [5.9, 160, 32]

Matrix: A 2D array. Your entire dataset is a matrix. Rows = samples, Columns = features

Tensor: An N-dimensional array. Images are 3D tensors (height × width × channels). Batches of images are 4D tensors.

python

1import numpy as np
2
3# Vectors
4features = np.array([5.9, 160, 32])
5weights = np.array([0.3, 0.01, 0.05])
6
7# Dot product — the fundamental operation in neural networks
8# Each neuron computes: dot(inputs, weights) + bias
9prediction = np.dot(features, weights)  # 5.9*0.3 + 160*0.01 + 32*0.05 = 4.97
10
11# Matrix multiplication — processing a whole batch at once
12# 3 samples, 4 features each
13X = np.array([
14    [1.0, 2.0, 3.0, 4.0],
15    [5.0, 6.0, 7.0, 8.0],
16    [9.0, 10., 11., 12.],
17])
18
19# Weight matrix: 4 inputs -> 2 outputs
20W = np.array([
21    [0.1, 0.2],
22    [0.3, 0.4],
23    [0.5, 0.6],
24    [0.7, 0.8],
25])
26
27# Matrix multiply: (3×4) @ (4×2) = (3×2)
28output = X @ W  # Each row is the output for one sample
29print(output.shape)  # (3, 2)

Why Matrix Multiplication Matters

Calculus — How Models Learn

The key concept: gradients tell us how to adjust parameters to reduce error.

Derivative: How much an output changes when you tweak an input.

Gradient: A vector of partial derivatives — tells you the direction of steepest increase.

Gradient Descent: Move parameters in the opposite direction of the gradient to minimize loss.

New weight = Old weight - learning_rate × gradient

That's it. That's how neural networks learn.

python

1# Simple gradient descent from scratch
2import numpy as np
3
4def gradient_descent_demo():
5    # Simple linear regression: y = wx + b
6    # True values: w=2, b=1
7    np.random.seed(42)
8    X = np.random.randn(100)
9    y_true = 2 * X + 1 + np.random.randn(100) * 0.1
10
11    # Initialize random parameters
12    w, b = 0.0, 0.0
13    learning_rate = 0.1
14
15    for epoch in range(50):
16        # Forward pass: make predictions
17        y_pred = w * X + b
18
19        # Compute loss (Mean Squared Error)
20        loss = np.mean((y_pred - y_true) ** 2)
21
22        # Compute gradients (derivatives of loss w.r.t. w and b)
23        dw = 2 * np.mean((y_pred - y_true) * X)
24        db = 2 * np.mean(y_pred - y_true)
25
26        # Update parameters (gradient descent step)
27        w -= learning_rate * dw
28        b -= learning_rate * db
29
30        if epoch % 10 == 0:
31            print(f"Epoch {epoch}: loss={loss:.4f}, w={w:.4f}, b={b:.4f}")
32
33    print(f"\nLearned: w={w:.4f} (true=2.0), b={b:.4f} (true=1.0)")
34
35gradient_descent_demo()

Statistics & Probability

Mean — Average value. Used everywhere (mean loss, mean accuracy).

Variance / Standard Deviation — How spread out data is. Critical for normalization.

Normal Distribution — Bell curve. Weight initialization, noise, many natural phenomena.

Bayes' Theorem — P(A|B) = P(B|A) × P(A) / P(B). Foundation of probabilistic ML.

Softmax — Converts raw numbers into probabilities (sums to 1). Used in classification output layers.

python

1import numpy as np
2
3# Softmax — converts logits to probabilities
4def softmax(x):
5    exp_x = np.exp(x - np.max(x))  # subtract max for numerical stability
6    return exp_x / exp_x.sum()
7
8logits = np.array([2.0, 1.0, 0.1])
9probs = softmax(logits)
10print(f"Probabilities: {probs}")  # [0.659, 0.242, 0.099]
11print(f"Sum: {probs.sum()}")       # 1.0
12
13# The highest logit gets the highest probability
14# This is how classification models make decisions