Skip to main content

NumPy: The Foundation of ML in Python

Master NumPy arrays, reshaping, indexing, broadcasting, and vectorized operations

~45 min
Listen to this lesson

NumPy: The Foundation of ML in Python

NumPy (Numerical Python) is the backbone of nearly every ML library in Python. TensorFlow, PyTorch, scikit-learn — they all rely on NumPy arrays under the hood. If you want to do ML, you must be fluent in NumPy.

Why NumPy?

  • Speed: Operations run in optimized C/Fortran, not slow Python loops
  • Memory: Stores data in contiguous memory blocks (cache-friendly)
  • Ecosystem: The universal data exchange format for ML libraries
  • Broadcasting: Powerful rules for combining arrays of different shapes
  • Creating Arrays

    NumPy arrays (ndarray) are the fundamental data structure. Here are the most common ways to create them:

    python
    1import numpy as np
    2
    3# From Python lists
    4a = np.array([1, 2, 3, 4, 5])
    5print(a)          # [1 2 3 4 5]
    6print(a.dtype)    # int64
    7print(a.shape)    # (5,)
    8
    9# 2D array (matrix)
    10matrix = np.array([[1, 2, 3],
    11                    [4, 5, 6]])
    12print(matrix.shape)  # (2, 3) — 2 rows, 3 columns
    13
    14# Common creation functions
    15zeros = np.zeros((3, 4))          # 3x4 matrix of zeros
    16ones = np.ones((2, 5))            # 2x5 matrix of ones
    17full = np.full((3, 3), 7)         # 3x3 matrix filled with 7
    18eye = np.eye(4)                   # 4x4 identity matrix
    19rand = np.random.randn(3, 4)     # 3x4 matrix of random normal values
    20arange = np.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
    21linspace = np.linspace(0, 1, 5)  # [0.0, 0.25, 0.5, 0.75, 1.0]

    Reshaping Arrays

    Reshaping is one of the most critical skills in ML. You'll constantly reshape data to match what models expect.

    python
    1import numpy as np
    2
    3a = np.arange(12)
    4print(a)         # [ 0  1  2  3  4  5  6  7  8  9 10 11]
    5print(a.shape)   # (12,)
    6
    7# Reshape to 3 rows x 4 columns
    8b = a.reshape(3, 4)
    9print(b)
    10# [[ 0  1  2  3]
    11#  [ 4  5  6  7]
    12#  [ 8  9 10 11]]
    13
    14# Using -1 lets NumPy infer the dimension
    15c = a.reshape(2, -1)   # 2 rows, NumPy figures out 6 columns
    16print(c.shape)          # (2, 6)
    17
    18d = a.reshape(-1, 3)   # NumPy figures out 4 rows, 3 columns
    19print(d.shape)          # (4, 3)
    20
    21# Flatten back to 1D
    22flat = b.flatten()      # Returns a copy
    23raveled = b.ravel()     # Returns a view (more memory efficient)
    24
    25# Add a dimension (critical for ML)
    26x = np.array([1, 2, 3])          # shape: (3,)
    27row_vec = x[np.newaxis, :]       # shape: (1, 3) — row vector
    28col_vec = x[:, np.newaxis]       # shape: (3, 1) — column vector
    29# Equivalent: x.reshape(1, -1) and x.reshape(-1, 1)

    Image Tensors

    In computer vision, images are represented as NumPy arrays. Understanding their shape is essential.

    python
    1import numpy as np
    2
    3# A single RGB image: (height, width, channels)
    4image = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8)
    5print(image.shape)   # (224, 224, 3)
    6print(image.dtype)   # uint8 (values 0–255)
    7
    8# A batch of images: (batch_size, height, width, channels)
    9batch = np.random.randint(0, 256, size=(32, 224, 224, 3), dtype=np.uint8)
    10print(batch.shape)   # (32, 224, 224, 3)
    11
    12# Access the 5th image in the batch
    13fifth_image = batch[4]         # shape: (224, 224, 3)
    14
    15# Get the red channel of the first image
    16red_channel = batch[0, :, :, 0]  # shape: (224, 224)
    17
    18# Normalize pixel values to [0, 1] for neural networks
    19normalized = batch.astype(np.float32) / 255.0
    20print(normalized.dtype)  # float32
    21print(normalized.max())  # 1.0

    Indexing and Slicing

    NumPy provides powerful ways to access and modify array elements.

    python
    1import numpy as np
    2
    3a = np.array([[10, 20, 30, 40],
    4              [50, 60, 70, 80],
    5              [90, 100, 110, 120]])
    6
    7# Basic indexing (row, column)
    8print(a[0, 1])     # 20 — first row, second column
    9print(a[2, -1])    # 120 — last row, last column
    10
    11# Slicing: a[row_start:row_end, col_start:col_end]
    12print(a[0:2, 1:3])
    13# [[20 30]
    14#  [60 70]]
    15
    16# All rows, specific columns
    17print(a[:, 0])     # [10 50 90] — first column
    18print(a[:, -1])    # [40 80 120] — last column
    19
    20# Boolean indexing (filtering)
    21mask = a > 50
    22print(mask)
    23# [[False False False False]
    24#  [False  True  True  True]
    25#  [ True  True  True  True]]
    26print(a[mask])     # [ 60  70  80  90 100 110 120]
    27
    28# Fancy indexing (index with arrays)
    29rows = np.array([0, 2])
    30cols = np.array([1, 3])
    31print(a[rows, cols])  # [20 120] — elements at (0,1) and (2,3)
    32
    33# Combining boolean and fancy indexing
    34scores = np.array([85, 42, 91, 67, 55, 99])
    35passing = scores[scores >= 60]
    36print(passing)  # [85 91 67 99]

    Broadcasting

    Broadcasting is NumPy's way of performing arithmetic on arrays of different shapes. Instead of copying data, NumPy virtually "stretches" smaller arrays to match larger ones. **The rules:** 1. If arrays have different numbers of dimensions, the smaller one is padded with 1s on the left 2. Arrays with size 1 in a dimension act as if they had the size of the largest array in that dimension 3. If sizes disagree and neither is 1, you get an error **Example:** Adding a (3, 4) matrix and a (4,) vector works because the vector is broadcast across all rows. This is how we can subtract the mean from every row, normalize every column, or add a bias to every sample — without writing a single loop.
    python
    1import numpy as np
    2
    3# Scalar broadcast: operates on every element
    4a = np.array([[1, 2, 3],
    5              [4, 5, 6]])
    6print(a * 10)
    7# [[10 20 30]
    8#  [40 50 60]]
    9
    10# Vector broadcast: vector applied to every row
    11row_means = a.mean(axis=1, keepdims=True)  # shape (2, 1)
    12centered = a - row_means  # subtracts each row's mean from that row
    13
    14# Common ML pattern: normalize features (columns)
    15data = np.random.randn(100, 5)  # 100 samples, 5 features
    16mean = data.mean(axis=0)        # shape (5,) — mean of each feature
    17std = data.std(axis=0)          # shape (5,) — std of each feature
    18normalized = (data - mean) / std  # broadcasting! shape stays (100, 5)
    19
    20# Outer product via broadcasting
    21x = np.array([1, 2, 3])[:, np.newaxis]  # shape (3, 1)
    22y = np.array([10, 20, 30])[np.newaxis, :]  # shape (1, 3)
    23outer = x * y  # shape (3, 3)
    24print(outer)
    25# [[ 10  20  30]
    26#  [ 20  40  60]
    27#  [ 30  60  90]]

    Vectorization: Why NumPy is Fast

    The #1 rule of NumPy: avoid Python loops. Use vectorized operations instead. The difference is dramatic.

    python
    1import numpy as np
    2import time
    3
    4size = 1_000_000
    5a = np.random.randn(size)
    6b = np.random.randn(size)
    7
    8# --- SLOW: Python loop ---
    9start = time.time()
    10result_loop = []
    11for i in range(size):
    12    result_loop.append(a[i] + b[i])
    13loop_time = time.time() - start
    14print(f"Python loop: {loop_time:.4f} seconds")
    15
    16# --- FAST: Vectorized NumPy ---
    17start = time.time()
    18result_vec = a + b
    19vec_time = time.time() - start
    20print(f"NumPy vectorized: {vec_time:.6f} seconds")
    21
    22print(f"Speedup: {loop_time / vec_time:.0f}x faster!")
    23# Typical output:
    24# Python loop: 0.2500 seconds
    25# NumPy vectorized: 0.001200 seconds
    26# Speedup: 208x faster!

    Vectorization Mindset

    Whenever you find yourself writing a for-loop over array elements, stop and ask: "Can I express this as a NumPy operation?" Common replacements: - `for x in arr: total += x` -> `arr.sum()` - `for i: result[i] = a[i] * b[i]` -> `a * b` - `for row in matrix: row / row.sum()` -> `matrix / matrix.sum(axis=1, keepdims=True)` This mindset is essential because the same pattern applies to TensorFlow and PyTorch.

    Essential Operations for ML

    Here are the NumPy operations you'll reach for constantly in ML work:

    python
    1import numpy as np
    2
    3data = np.random.randn(5, 3)
    4
    5# Aggregation along axes
    6print(data.sum(axis=0))    # sum each column — shape (3,)
    7print(data.sum(axis=1))    # sum each row — shape (5,)
    8print(data.mean(axis=0))   # mean of each feature
    9print(data.std(axis=0))    # std of each feature
    10
    11# Matrix operations
    12A = np.random.randn(3, 4)
    13B = np.random.randn(4, 2)
    14C = A @ B                  # matrix multiply — shape (3, 2)
    15# Equivalent: np.dot(A, B) or np.matmul(A, B)
    16
    17# Transpose
    18print(A.T.shape)            # (4, 3)
    19
    20# Stacking arrays
    21x1 = np.array([1, 2, 3])
    22x2 = np.array([4, 5, 6])
    23vertical = np.vstack([x1, x2])    # shape (2, 3)
    24horizontal = np.hstack([x1, x2])  # shape (6,)
    25
    26# Argmax / Argmin (critical for classification)
    27predictions = np.array([0.1, 0.7, 0.2])
    28predicted_class = np.argmax(predictions)  # 1
    29print(predicted_class)
    30
    31# Where (conditional selection)
    32scores = np.array([85, 42, 91, 67])
    33result = np.where(scores >= 60, "pass", "fail")
    34print(result)  # ['pass' 'fail' 'pass' 'pass']