Feature Importance & Model Interpretability

When a model predicts that a loan application should be denied, the applicant has a right to know why. When a model recommends a medical treatment, the clinician needs to understand what drove that recommendation. Feature importance methods answer the fundamental question: which input features matter most to the model's predictions?

This lesson covers both model-specific and model-agnostic methods for understanding feature contributions.

Built-in vs Model-Agnostic Importance

Built-in feature importance (e.g., Gini importance in random forests) is computed from the model's internal structure and is fast but can be biased toward high-cardinality features. Model-agnostic methods (permutation importance, PDP, SHAP) work with any model and are more reliable, but are computationally expensive.

Built-in Feature Importance

Tree-based models (Random Forest, Gradient Boosting, XGBoost) provide built-in feature importance scores:

Method	How it works	Pros	Cons
Gini / Impurity	Total reduction in impurity from splits on each feature	Fast, always available	Biased toward high-cardinality and continuous features
Split count	Number of times a feature is used in splits	Simple to understand	Biased toward features used in early splits
Gain	Total gain (improvement) from splits on each feature	Better than count	Still model-specific

The Cardinality Bias Problem

Built-in importance is biased: a feature with 100 unique values (like a random ID) will appear more important than a binary feature, even if the binary feature is the true predictor. This is because high-cardinality features offer more splitting opportunities by chance.

python

1import numpy as np
2from sklearn.datasets import make_classification
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.inspection import permutation_importance
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Create dataset with known important features ---
9np.random.seed(42)
10X, y = make_classification(
11    n_samples=1000, n_features=10, n_informative=3,
12    n_redundant=2, n_classes=2, random_state=42
13)
14feature_names = [f"feature_{i}" for i in range(10)]
15
16# Add a HIGH-CARDINALITY noise feature (random ID)
17random_id = np.random.randint(0, 500, size=(1000, 1))
18X = np.hstack([X, random_id])
19feature_names.append("random_id")
20
21# --- Train a Random Forest ---
22rf = RandomForestClassifier(n_estimators=100, random_state=42)
23rf.fit(X, y)
24
25# --- Built-in (Gini) importance ---
26gini_imp = rf.feature_importances_
27print("=== Built-in (Gini) Importance ===")
28for name, imp in sorted(zip(feature_names, gini_imp),
29                         key=lambda x: x[1], reverse=True):
30    print(f"  {name:>15}: {imp:.4f}")
31
32# --- Permutation importance (model-agnostic) ---
33perm_result = permutation_importance(
34    rf, X, y, n_repeats=10, random_state=42, n_jobs=-1
35)
36
37print("\n=== Permutation Importance ===")
38for name, imp, std in sorted(
39    zip(feature_names, perm_result.importances_mean,
40        perm_result.importances_std),
41    key=lambda x: x[1], reverse=True
42):
43    print(f"  {name:>15}: {imp:.4f} +/- {std:.4f}")
44
45print("\nNotice: Gini importance inflates 'random_id' due to its")
46print("high cardinality. Permutation importance correctly shows")
47print("it has near-zero importance.")

Never Trust Gini Importance Alone

Always validate built-in importance with permutation importance. In practice, built-in importance can rank noise features higher than true predictors when they have more unique values.

Permutation Importance

Permutation importance measures how much model performance degrades when a single feature's values are randomly shuffled. If shuffling a feature causes a large drop in accuracy, that feature is important.

Algorithm

1. Train the model and compute baseline score on validation data 2. For each feature: - Shuffle that feature's values (breaking its relationship with the target) - Re-score the model on the shuffled data - Importance = baseline score - shuffled score 3. Repeat multiple times and average for stability

Advantages

Works with any model (model-agnostic)

Accounts for all interactions the feature participates in

Uses the same metric you care about (accuracy, AUC, etc.)

Limitations

Slow for large datasets (requires re-evaluation per feature)

Correlated features can mask each other's importance

Measures importance on the validation set, not the training set

Partial Dependence Plots (PDP)

A Partial Dependence Plot shows the marginal effect of one or two features on the model's predicted outcome, averaging over all other features. It answers: "What is the average prediction if I set feature X to value v?"

How PDP Works

For each value v of feature X_s: 1. Set X_s = v for all samples in the dataset 2. Compute the model prediction for each sample 3. Average all predictions

This gives the average prediction as a function of X_s, marginalizing out all other features.

Individual Conditional Expectation (ICE)

ICE plots are the disaggregated version of PDP: instead of averaging, they show one line per sample. This reveals heterogeneous effects that PDP averages away.

PDP	ICE
One averaged line	One line per sample
Shows average effect	Shows individual effects
Can hide interactions	Reveals interactions
Clean and simple	Can be cluttered

python

1import numpy as np
2from sklearn.datasets import fetch_california_housing
3from sklearn.ensemble import GradientBoostingRegressor
4from sklearn.inspection import PartialDependenceDisplay
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Load California Housing dataset ---
9housing = fetch_california_housing()
10X, y = housing.data, housing.target
11feature_names = housing.feature_names
12
13# --- Train a Gradient Boosting model ---
14gbr = GradientBoostingRegressor(
15    n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
16)
17gbr.fit(X, y)
18
19# --- Partial Dependence for top features ---
20# Features: MedInc, AveRooms, HouseAge, Latitude
21features_to_plot = [0, 3, 6, 7]  # MedInc, HouseAge, Latitude, Longitude
22
23print("=== Partial Dependence Values (MedInc) ===")
24from sklearn.inspection import partial_dependence
25
26# Compute PDP for MedInc (feature 0)
27pdp_result = partial_dependence(
28    gbr, X, features=[0], kind="average", grid_resolution=20
29)
30
31grid_values = pdp_result["grid_values"][0]
32avg_predictions = pdp_result["average"][0]
33
34for val, pred in zip(grid_values[::4], avg_predictions[::4]):
35    print(f"  MedInc = {val:.2f}  ->  Avg predicted price: ${pred * 100000:.0f}")
36
37# --- Compute ICE for MedInc ---
38ice_result = partial_dependence(
39    gbr, X[:50], features=[0], kind="individual", grid_resolution=20
40)
41
42print("\n=== ICE: Individual predictions vary ===")
43print(f"At MedInc={grid_values[0]:.1f}: "
44      f"min=${ice_result['individual'][0][:, 0].min() * 100000:.0f}, "
45      f"max=${ice_result['individual'][0][:, 0].max() * 100000:.0f}")
46print(f"At MedInc={grid_values[-1]:.1f}: "
47      f"min=${ice_result['individual'][0][:, -1].min() * 100000:.0f}, "
48      f"max=${ice_result['individual'][0][:, -1].max() * 100000:.0f}")
49print("\nICE reveals that the effect of income on price varies")
50print("substantially across neighborhoods (heterogeneous effect).")

Feature Interaction

Features often interact: the effect of one feature depends on the value of another. For example, the effect of house age on price might differ between urban and rural areas.

Detecting Interactions

H-statistic (Friedman's H): Measures the proportion of variance in partial dependence explained by interactions. An H of 0 means no interaction; an H of 1 means the entire effect is due to interaction.

2D Partial Dependence: Plot PDP over two features simultaneously to visualize their joint effect. If the surface is additive (a flat plane), there is no interaction. Curves and valleys indicate interactions.

Practical Recommendations

1. Start with permutation importance to identify top features 2. Use PDP to understand the direction and shape of each feature's effect 3. Use ICE plots to check for heterogeneous effects 4. Use 2D PDP or H-statistic to identify important interactions 5. Validate with SHAP (next lesson) for precise attributions

Correlated Features

When features are correlated, both PDP and permutation importance can give misleading results. PDP extrapolates to unlikely feature combinations (e.g., 10 rooms but 500 sq ft), and permutation importance can understate the importance of correlated features. Consider using SHAP or conditional permutation importance for correlated data.

Feature Importance & Model Interpretability

This lesson covers both model-specific and model-agnostic methods for understanding feature contributions.

Built-in vs Model-Agnostic Importance

Built-in Feature Importance

Tree-based models (Random Forest, Gradient Boosting, XGBoost) provide built-in feature importance scores:

Method	How it works	Pros	Cons
Gini / Impurity	Total reduction in impurity from splits on each feature	Fast, always available	Biased toward high-cardinality and continuous features
Split count	Number of times a feature is used in splits	Simple to understand	Biased toward features used in early splits
Gain	Total gain (improvement) from splits on each feature	Better than count	Still model-specific

The Cardinality Bias Problem

python

1import numpy as np
2from sklearn.datasets import make_classification
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.inspection import permutation_importance
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Create dataset with known important features ---
9np.random.seed(42)
10X, y = make_classification(
11    n_samples=1000, n_features=10, n_informative=3,
12    n_redundant=2, n_classes=2, random_state=42
13)
14feature_names = [f"feature_{i}" for i in range(10)]
15
16# Add a HIGH-CARDINALITY noise feature (random ID)
17random_id = np.random.randint(0, 500, size=(1000, 1))
18X = np.hstack([X, random_id])
19feature_names.append("random_id")
20
21# --- Train a Random Forest ---
22rf = RandomForestClassifier(n_estimators=100, random_state=42)
23rf.fit(X, y)
24
25# --- Built-in (Gini) importance ---
26gini_imp = rf.feature_importances_
27print("=== Built-in (Gini) Importance ===")
28for name, imp in sorted(zip(feature_names, gini_imp),
29                         key=lambda x: x[1], reverse=True):
30    print(f"  {name:>15}: {imp:.4f}")
31
32# --- Permutation importance (model-agnostic) ---
33perm_result = permutation_importance(
34    rf, X, y, n_repeats=10, random_state=42, n_jobs=-1
35)
36
37print("\n=== Permutation Importance ===")
38for name, imp, std in sorted(
39    zip(feature_names, perm_result.importances_mean,
40        perm_result.importances_std),
41    key=lambda x: x[1], reverse=True
42):
43    print(f"  {name:>15}: {imp:.4f} +/- {std:.4f}")
44
45print("\nNotice: Gini importance inflates 'random_id' due to its")
46print("high cardinality. Permutation importance correctly shows")
47print("it has near-zero importance.")

Never Trust Gini Importance Alone

Always validate built-in importance with permutation importance. In practice, built-in importance can rank noise features higher than true predictors when they have more unique values.

Permutation Importance

Algorithm

Advantages

Works with any model (model-agnostic)

Accounts for all interactions the feature participates in

Uses the same metric you care about (accuracy, AUC, etc.)

Limitations

Slow for large datasets (requires re-evaluation per feature)

Correlated features can mask each other's importance

Measures importance on the validation set, not the training set

Partial Dependence Plots (PDP)

How PDP Works

For each value v of feature X_s: 1. Set X_s = v for all samples in the dataset 2. Compute the model prediction for each sample 3. Average all predictions

This gives the average prediction as a function of X_s, marginalizing out all other features.

Individual Conditional Expectation (ICE)

ICE plots are the disaggregated version of PDP: instead of averaging, they show one line per sample. This reveals heterogeneous effects that PDP averages away.

PDP	ICE
One averaged line	One line per sample
Shows average effect	Shows individual effects
Can hide interactions	Reveals interactions
Clean and simple	Can be cluttered

python

1import numpy as np
2from sklearn.datasets import fetch_california_housing
3from sklearn.ensemble import GradientBoostingRegressor
4from sklearn.inspection import PartialDependenceDisplay
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Load California Housing dataset ---
9housing = fetch_california_housing()
10X, y = housing.data, housing.target
11feature_names = housing.feature_names
12
13# --- Train a Gradient Boosting model ---
14gbr = GradientBoostingRegressor(
15    n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
16)
17gbr.fit(X, y)
18
19# --- Partial Dependence for top features ---
20# Features: MedInc, AveRooms, HouseAge, Latitude
21features_to_plot = [0, 3, 6, 7]  # MedInc, HouseAge, Latitude, Longitude
22
23print("=== Partial Dependence Values (MedInc) ===")
24from sklearn.inspection import partial_dependence
25
26# Compute PDP for MedInc (feature 0)
27pdp_result = partial_dependence(
28    gbr, X, features=[0], kind="average", grid_resolution=20
29)
30
31grid_values = pdp_result["grid_values"][0]
32avg_predictions = pdp_result["average"][0]
33
34for val, pred in zip(grid_values[::4], avg_predictions[::4]):
35    print(f"  MedInc = {val:.2f}  ->  Avg predicted price: ${pred * 100000:.0f}")
36
37# --- Compute ICE for MedInc ---
38ice_result = partial_dependence(
39    gbr, X[:50], features=[0], kind="individual", grid_resolution=20
40)
41
42print("\n=== ICE: Individual predictions vary ===")
43print(f"At MedInc={grid_values[0]:.1f}: "
44      f"min=${ice_result['individual'][0][:, 0].min() * 100000:.0f}, "
45      f"max=${ice_result['individual'][0][:, 0].max() * 100000:.0f}")
46print(f"At MedInc={grid_values[-1]:.1f}: "
47      f"min=${ice_result['individual'][0][:, -1].min() * 100000:.0f}, "
48      f"max=${ice_result['individual'][0][:, -1].max() * 100000:.0f}")
49print("\nICE reveals that the effect of income on price varies")
50print("substantially across neighborhoods (heterogeneous effect).")

Feature Interaction

Features often interact: the effect of one feature depends on the value of another. For example, the effect of house age on price might differ between urban and rural areas.