Linear & Logistic Regression

Linear models are the workhorses of machine learning. They are fast, interpretable, and surprisingly powerful. Understanding them deeply gives you a foundation for understanding all other models.

Linear Regression

Linear regression models the relationship between features and a continuous target as a weighted sum:

**y = w1*x1 + w2*x2 + ... + wn*xn + b

Where:

w (weights/coefficients) determine the importance of each feature

b (bias/intercept) is the baseline prediction when all features are zero

The goal is to find w and b that minimize the error
Ordinary Least Squares (OLS)

The most common approach minimizes the Mean Squared Error (MSE):

MSE = (1/n) * sum((y_pred - y_actual)^2)

This is called Ordinary Least Squares because it minimizes the sum of squared residuals. There are two ways to solve it:

1. Normal Equation: Closed-form solution (fast for small datasets) 2. Gradient Descent**: Iterative optimization (scales to large datasets)

The Cost Function

The cost function (also called loss function or objective function) measures how wrong your model is. For linear regression, we use MSE. Training = finding the parameters that minimize this cost. Every ML algorithm has a cost function it optimizes.

python

1from sklearn.linear_model import LinearRegression
2from sklearn.datasets import make_regression
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import mean_squared_error, r2_score
5import numpy as np
6
7# Generate synthetic regression data
8X, y = make_regression(n_samples=200, n_features=3, noise=10, random_state=42)
9X_train, X_test, y_train, y_test = train_test_split(
10    X, y, test_size=0.2, random_state=42
11)
12
13# Train a linear regression model
14model = LinearRegression()
15model.fit(X_train, y_train)
16
17# Evaluate
18y_pred = model.predict(X_test)
19mse = mean_squared_error(y_test, y_pred)
20r2 = r2_score(y_test, y_pred)
21
22print(f"Coefficients: {model.coef_}")
23print(f"Intercept:    {model.intercept_:.4f}")
24print(f"MSE:          {mse:.4f}")
25print(f"R2 Score:     {r2:.4f}")

Regularization: Ridge, Lasso, and ElasticNet

Plain linear regression can overfit when you have many features or correlated features. Regularization adds a penalty term to the cost function that discourages large coefficients.

Ridge Regression (L2 Regularization)

Adds the sum of squared coefficients to the cost:

**Cost = MSE + alpha * sum(w^2)

Shrinks coefficients toward zero but never exactly to zero

Good when you have many correlated features

All features are kept in the model
Lasso Regression (L1 Regularization)

Adds the sum of absolute coefficients to the cost:

Cost = MSE + alpha * sum(|w|)

Can shrink coefficients exactly to zero (feature selection!)

Good when you suspect only a few features matter

Automatically removes irrelevant features
ElasticNet (L1 + L2 Combined)

Cost = MSE + alpha * (l1_ratio * sum(|w|) + (1 - l1_ratio) * sum(w^2))**

Combines benefits of both Ridge and Lasso

The l1_ratio parameter controls the mix (0 = Ridge, 1 = Lasso)

L1 vs L2 Regularization

L1 (Lasso) produces sparse models by driving some weights to exactly zero, effectively performing feature selection. L2 (Ridge) distributes the penalty evenly across all features, shrinking them toward zero but keeping them all. Use L1 when you want automatic feature selection; use L2 when all features might be relevant.

python

1from sklearn.linear_model import Ridge, Lasso, ElasticNet
2import numpy as np
3
4# Compare regularization methods
5models = {
6    "Linear": LinearRegression(),
7    "Ridge (L2)": Ridge(alpha=1.0),
8    "Lasso (L1)": Lasso(alpha=1.0),
9    "ElasticNet": ElasticNet(alpha=1.0, l1_ratio=0.5),
10}
11
12print(f"{'Model':<18} {'R2':>8} {'Non-zero coefs':>16}")
13print("-" * 44)
14for name, model in models.items():
15    model.fit(X_train, y_train)
16    r2 = model.score(X_test, y_test)
17    non_zero = np.sum(np.abs(model.coef_) > 1e-6)
18    print(f"{name:<18} {r2:>8.4f} {non_zero:>16}")

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It uses the sigmoid function to map any real number to a probability between 0 and 1:

sigma(z) = 1 / (1 + e^(-z))

Where z = w*x + b (just like linear regression). The sigmoid "squashes" the output:

Large positive z -> probability near 1

Large negative z -> probability near 0

z = 0 -> probability = 0.5 (decision boundary)

The model predicts class 1 if the probability > 0.5, and class 0 otherwise.

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.datasets import load_breast_cancer
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import accuracy_score, classification_report
5
6# Load binary classification dataset
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(
9    X, y, test_size=0.2, random_state=42, stratify=y
10)
11
12# Train logistic regression
13model = LogisticRegression(max_iter=5000, random_state=42)
14model.fit(X_train, y_train)
15
16# Predict probabilities and classes
17y_prob = model.predict_proba(X_test)[:5]  # First 5 probabilities
18y_pred = model.predict(X_test)
19
20print("First 5 predicted probabilities [class 0, class 1]:")
21for i, probs in enumerate(y_prob):
22    print(f"  Sample {i}: [{probs[0]:.4f}, {probs[1]:.4f}] -> class {y_pred[i]}")
23
24print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
25print(f"\n{classification_report(y_test, y_pred)}")

Multi-Class Classification

For more than 2 classes, logistic regression extends via:

One-vs-Rest (OvR): Train one binary classifier per class. Each classifier answers "Is it this class or not?" The class with the highest confidence wins.

Multinomial / Softmax: Directly model probabilities across all classes. The softmax function generalizes the sigmoid to multiple classes:

P(class_k) = e^(z_k) / sum(e^(z_j) for all j)

Softmax ensures all class probabilities sum to 1.

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.datasets import load_iris
3
4X, y = load_iris(return_X_y=True)
5X_train, X_test, y_train, y_test = train_test_split(
6    X, y, test_size=0.2, random_state=42, stratify=y
7)
8
9# Multi-class with softmax (multinomial)
10model = LogisticRegression(
11    multi_class="multinomial",
12    solver="lbfgs",
13    max_iter=200,
14    random_state=42
15)
16model.fit(X_train, y_train)
17
18# Predict probabilities for all 3 classes
19sample_probs = model.predict_proba(X_test[:3])
20class_names = load_iris().target_names
21
22print("Predicted probabilities:")
23for i, probs in enumerate(sample_probs):
24    print(f"  Sample {i}: {dict(zip(class_names, probs.round(4)))}")
25
26print(f"\nAccuracy: {model.score(X_test, y_test):.4f}")

When to Use Linear vs Logistic Regression

Use Linear Regression when your target is continuous (prices, temperatures, scores). Use Logistic Regression when your target is categorical (spam/not spam, species, diagnosis). A common beginner mistake is using linear regression for classification -- it can produce predictions outside [0,1] and doesn't model probabilities correctly.

Linear & Logistic Regression

Linear models are the workhorses of machine learning. They are fast, interpretable, and surprisingly powerful. Understanding them deeply gives you a foundation for understanding all other models.

Linear Regression

Linear regression models the relationship between features and a continuous target as a weighted sum:

**y = w1*x1 + w2*x2 + ... + wn*xn + b

Where:

w (weights/coefficients) determine the importance of each feature

b (bias/intercept) is the baseline prediction when all features are zero

The goal is to find w and b that minimize the error
Ordinary Least Squares (OLS)

The most common approach minimizes the Mean Squared Error (MSE):

MSE = (1/n) * sum((y_pred - y_actual)^2)

This is called Ordinary Least Squares because it minimizes the sum of squared residuals. There are two ways to solve it:

1. Normal Equation: Closed-form solution (fast for small datasets) 2. Gradient Descent**: Iterative optimization (scales to large datasets)

The Cost Function

python

1from sklearn.linear_model import LinearRegression
2from sklearn.datasets import make_regression
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import mean_squared_error, r2_score
5import numpy as np
6
7# Generate synthetic regression data
8X, y = make_regression(n_samples=200, n_features=3, noise=10, random_state=42)
9X_train, X_test, y_train, y_test = train_test_split(
10    X, y, test_size=0.2, random_state=42
11)
12
13# Train a linear regression model
14model = LinearRegression()
15model.fit(X_train, y_train)
16
17# Evaluate
18y_pred = model.predict(X_test)
19mse = mean_squared_error(y_test, y_pred)
20r2 = r2_score(y_test, y_pred)
21
22print(f"Coefficients: {model.coef_}")
23print(f"Intercept:    {model.intercept_:.4f}")
24print(f"MSE:          {mse:.4f}")
25print(f"R2 Score:     {r2:.4f}")

Regularization: Ridge, Lasso, and ElasticNet

Plain linear regression can overfit when you have many features or correlated features. Regularization adds a penalty term to the cost function that discourages large coefficients.

Ridge Regression (L2 Regularization)

Adds the sum of squared coefficients to the cost:

**Cost = MSE + alpha * sum(w^2)

Shrinks coefficients toward zero but never exactly to zero

Good when you have many correlated features

All features are kept in the model
Lasso Regression (L1 Regularization)

Adds the sum of absolute coefficients to the cost:

Cost = MSE + alpha * sum(|w|)

Can shrink coefficients exactly to zero (feature selection!)

Good when you suspect only a few features matter

Automatically removes irrelevant features
ElasticNet (L1 + L2 Combined)

Cost = MSE + alpha * (l1_ratio * sum(|w|) + (1 - l1_ratio) * sum(w^2))**

Combines benefits of both Ridge and Lasso

The l1_ratio parameter controls the mix (0 = Ridge, 1 = Lasso)

L1 vs L2 Regularization

python

1from sklearn.linear_model import Ridge, Lasso, ElasticNet
2import numpy as np
3
4# Compare regularization methods
5models = {
6    "Linear": LinearRegression(),
7    "Ridge (L2)": Ridge(alpha=1.0),
8    "Lasso (L1)": Lasso(alpha=1.0),
9    "ElasticNet": ElasticNet(alpha=1.0, l1_ratio=0.5),
10}
11
12print(f"{'Model':<18} {'R2':>8} {'Non-zero coefs':>16}")
13print("-" * 44)
14for name, model in models.items():
15    model.fit(X_train, y_train)
16    r2 = model.score(X_test, y_test)
17    non_zero = np.sum(np.abs(model.coef_) > 1e-6)
18    print(f"{name:<18} {r2:>8.4f} {non_zero:>16}")

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It uses the sigmoid function to map any real number to a probability between 0 and 1:

sigma(z) = 1 / (1 + e^(-z))

Where z = w*x + b (just like linear regression). The sigmoid "squashes" the output:

Large positive z -> probability near 1

Large negative z -> probability near 0

z = 0 -> probability = 0.5 (decision boundary)

The model predicts class 1 if the probability > 0.5, and class 0 otherwise.

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.datasets import load_breast_cancer
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import accuracy_score, classification_report
5
6# Load binary classification dataset
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(
9    X, y, test_size=0.2, random_state=42, stratify=y
10)
11
12# Train logistic regression
13model = LogisticRegression(max_iter=5000, random_state=42)
14model.fit(X_train, y_train)
15
16# Predict probabilities and classes
17y_prob = model.predict_proba(X_test)[:5]  # First 5 probabilities
18y_pred = model.predict(X_test)
19
20print("First 5 predicted probabilities [class 0, class 1]:")
21for i, probs in enumerate(y_prob):
22    print(f"  Sample {i}: [{probs[0]:.4f}, {probs[1]:.4f}] -> class {y_pred[i]}")
23
24print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
25print(f"\n{classification_report(y_test, y_pred)}")

Multi-Class Classification

For more than 2 classes, logistic regression extends via:

One-vs-Rest (OvR): Train one binary classifier per class. Each classifier answers "Is it this class or not?" The class with the highest confidence wins.

Multinomial / Softmax: Directly model probabilities across all classes. The softmax function generalizes the sigmoid to multiple classes:

P(class_k) = e^(z_k) / sum(e^(z_j) for all j)

Softmax ensures all class probabilities sum to 1.

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.datasets import load_iris
3
4X, y = load_iris(return_X_y=True)
5X_train, X_test, y_train, y_test = train_test_split(
6    X, y, test_size=0.2, random_state=42, stratify=y
7)
8
9# Multi-class with softmax (multinomial)
10model = LogisticRegression(
11    multi_class="multinomial",
12    solver="lbfgs",
13    max_iter=200,
14    random_state=42
15)
16model.fit(X_train, y_train)
17
18# Predict probabilities for all 3 classes
19sample_probs = model.predict_proba(X_test[:3])
20class_names = load_iris().target_names
21
22print("Predicted probabilities:")
23for i, probs in enumerate(sample_probs):
24    print(f"  Sample {i}: {dict(zip(class_names, probs.round(4)))}")
25
26print(f"\nAccuracy: {model.score(X_test, y_test):.4f}")