Feature Importance & Model Interpretability
When a model predicts that a loan application should be denied, the applicant has a right to know why. When a model recommends a medical treatment, the clinician needs to understand what drove that recommendation. Feature importance methods answer the fundamental question: which input features matter most to the model's predictions?
This lesson covers both model-specific and model-agnostic methods for understanding feature contributions.
Built-in vs Model-Agnostic Importance
Built-in Feature Importance
Tree-based models (Random Forest, Gradient Boosting, XGBoost) provide built-in feature importance scores:
| Method | How it works | Pros | Cons |
|---|---|---|---|
| Gini / Impurity | Total reduction in impurity from splits on each feature | Fast, always available | Biased toward high-cardinality and continuous features |
| Split count | Number of times a feature is used in splits | Simple to understand | Biased toward features used in early splits |
| Gain | Total gain (improvement) from splits on each feature | Better than count | Still model-specific |
The Cardinality Bias Problem
Built-in importance is biased: a feature with 100 unique values (like a random ID) will appear more important than a binary feature, even if the binary feature is the true predictor. This is because high-cardinality features offer more splitting opportunities by chance.
1import numpy as np
2from sklearn.datasets import make_classification
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.inspection import permutation_importance
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Create dataset with known important features ---
9np.random.seed(42)
10X, y = make_classification(
11 n_samples=1000, n_features=10, n_informative=3,
12 n_redundant=2, n_classes=2, random_state=42
13)
14feature_names = [f"feature_{i}" for i in range(10)]
15
16# Add a HIGH-CARDINALITY noise feature (random ID)
17random_id = np.random.randint(0, 500, size=(1000, 1))
18X = np.hstack([X, random_id])
19feature_names.append("random_id")
20
21# --- Train a Random Forest ---
22rf = RandomForestClassifier(n_estimators=100, random_state=42)
23rf.fit(X, y)
24
25# --- Built-in (Gini) importance ---
26gini_imp = rf.feature_importances_
27print("=== Built-in (Gini) Importance ===")
28for name, imp in sorted(zip(feature_names, gini_imp),
29 key=lambda x: x[1], reverse=True):
30 print(f" {name:>15}: {imp:.4f}")
31
32# --- Permutation importance (model-agnostic) ---
33perm_result = permutation_importance(
34 rf, X, y, n_repeats=10, random_state=42, n_jobs=-1
35)
36
37print("\n=== Permutation Importance ===")
38for name, imp, std in sorted(
39 zip(feature_names, perm_result.importances_mean,
40 perm_result.importances_std),
41 key=lambda x: x[1], reverse=True
42):
43 print(f" {name:>15}: {imp:.4f} +/- {std:.4f}")
44
45print("\nNotice: Gini importance inflates 'random_id' due to its")
46print("high cardinality. Permutation importance correctly shows")
47print("it has near-zero importance.")Never Trust Gini Importance Alone
Permutation Importance
Permutation importance measures how much model performance degrades when a single feature's values are randomly shuffled. If shuffling a feature causes a large drop in accuracy, that feature is important.
Algorithm
1. Train the model and compute baseline score on validation data 2. For each feature: - Shuffle that feature's values (breaking its relationship with the target) - Re-score the model on the shuffled data - Importance = baseline score - shuffled score 3. Repeat multiple times and average for stabilityAdvantages
Limitations
Partial Dependence Plots (PDP)
A Partial Dependence Plot shows the marginal effect of one or two features on the model's predicted outcome, averaging over all other features. It answers: "What is the average prediction if I set feature X to value v?"
How PDP Works
For each valuev of feature X_s:
1. Set X_s = v for all samples in the dataset
2. Compute the model prediction for each sample
3. Average all predictionsThis gives the average prediction as a function of X_s, marginalizing out all other features.
Individual Conditional Expectation (ICE)
ICE plots are the disaggregated version of PDP: instead of averaging, they show one line per sample. This reveals heterogeneous effects that PDP averages away.
| PDP | ICE |
|---|---|
| One averaged line | One line per sample |
| Shows average effect | Shows individual effects |
| Can hide interactions | Reveals interactions |
| Clean and simple | Can be cluttered |
1import numpy as np
2from sklearn.datasets import fetch_california_housing
3from sklearn.ensemble import GradientBoostingRegressor
4from sklearn.inspection import PartialDependenceDisplay
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Load California Housing dataset ---
9housing = fetch_california_housing()
10X, y = housing.data, housing.target
11feature_names = housing.feature_names
12
13# --- Train a Gradient Boosting model ---
14gbr = GradientBoostingRegressor(
15 n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
16)
17gbr.fit(X, y)
18
19# --- Partial Dependence for top features ---
20# Features: MedInc, AveRooms, HouseAge, Latitude
21features_to_plot = [0, 3, 6, 7] # MedInc, HouseAge, Latitude, Longitude
22
23print("=== Partial Dependence Values (MedInc) ===")
24from sklearn.inspection import partial_dependence
25
26# Compute PDP for MedInc (feature 0)
27pdp_result = partial_dependence(
28 gbr, X, features=[0], kind="average", grid_resolution=20
29)
30
31grid_values = pdp_result["grid_values"][0]
32avg_predictions = pdp_result["average"][0]
33
34for val, pred in zip(grid_values[::4], avg_predictions[::4]):
35 print(f" MedInc = {val:.2f} -> Avg predicted price: ${pred * 100000:.0f}")
36
37# --- Compute ICE for MedInc ---
38ice_result = partial_dependence(
39 gbr, X[:50], features=[0], kind="individual", grid_resolution=20
40)
41
42print("\n=== ICE: Individual predictions vary ===")
43print(f"At MedInc={grid_values[0]:.1f}: "
44 f"min=${ice_result['individual'][0][:, 0].min() * 100000:.0f}, "
45 f"max=${ice_result['individual'][0][:, 0].max() * 100000:.0f}")
46print(f"At MedInc={grid_values[-1]:.1f}: "
47 f"min=${ice_result['individual'][0][:, -1].min() * 100000:.0f}, "
48 f"max=${ice_result['individual'][0][:, -1].max() * 100000:.0f}")
49print("\nICE reveals that the effect of income on price varies")
50print("substantially across neighborhoods (heterogeneous effect).")Feature Interaction
Features often interact: the effect of one feature depends on the value of another. For example, the effect of house age on price might differ between urban and rural areas.
Detecting Interactions
H-statistic (Friedman's H): Measures the proportion of variance in partial dependence explained by interactions. An H of 0 means no interaction; an H of 1 means the entire effect is due to interaction.
2D Partial Dependence: Plot PDP over two features simultaneously to visualize their joint effect. If the surface is additive (a flat plane), there is no interaction. Curves and valleys indicate interactions.
Practical Recommendations
1. Start with permutation importance to identify top features 2. Use PDP to understand the direction and shape of each feature's effect 3. Use ICE plots to check for heterogeneous effects 4. Use 2D PDP or H-statistic to identify important interactions 5. Validate with SHAP (next lesson) for precise attributions