Skip to main content

Feature Importance

Permutation importance, partial dependence plots (PDP), individual conditional expectation (ICE), feature interaction, and built-in vs model-agnostic importance

~40 min
Listen to this lesson

Feature Importance & Model Interpretability

When a model predicts that a loan application should be denied, the applicant has a right to know why. When a model recommends a medical treatment, the clinician needs to understand what drove that recommendation. Feature importance methods answer the fundamental question: which input features matter most to the model's predictions?

This lesson covers both model-specific and model-agnostic methods for understanding feature contributions.

Built-in vs Model-Agnostic Importance

Built-in feature importance (e.g., Gini importance in random forests) is computed from the model's internal structure and is fast but can be biased toward high-cardinality features. Model-agnostic methods (permutation importance, PDP, SHAP) work with any model and are more reliable, but are computationally expensive.

Built-in Feature Importance

Tree-based models (Random Forest, Gradient Boosting, XGBoost) provide built-in feature importance scores:

MethodHow it worksProsCons
Gini / ImpurityTotal reduction in impurity from splits on each featureFast, always availableBiased toward high-cardinality and continuous features
Split countNumber of times a feature is used in splitsSimple to understandBiased toward features used in early splits
GainTotal gain (improvement) from splits on each featureBetter than countStill model-specific

The Cardinality Bias Problem

Built-in importance is biased: a feature with 100 unique values (like a random ID) will appear more important than a binary feature, even if the binary feature is the true predictor. This is because high-cardinality features offer more splitting opportunities by chance.

python
1import numpy as np
2from sklearn.datasets import make_classification
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.inspection import permutation_importance
5import warnings
6warnings.filterwarnings("ignore")
7
8# --- Create dataset with known important features ---
9np.random.seed(42)
10X, y = make_classification(
11    n_samples=1000, n_features=10, n_informative=3,
12    n_redundant=2, n_classes=2, random_state=42
13)
14feature_names = [f"feature_{i}" for i in range(10)]
15
16# Add a HIGH-CARDINALITY noise feature (random ID)
17random_id = np.random.randint(0, 500, size=(1000, 1))
18X = np.hstack([X, random_id])
19feature_names.append("random_id")
20
21# --- Train a Random Forest ---
22rf = RandomForestClassifier(n_estimators=100, random_state=42)
23rf.fit(X, y)
24
25# --- Built-in (Gini) importance ---
26gini_imp = rf.feature_importances_
27print("=== Built-in (Gini) Importance ===")
28for name, imp in sorted(zip(feature_names, gini_imp),
29                         key=lambda x: x[1], reverse=True):
30    print(f"  {name:>15}: {imp:.4f}")
31
32# --- Permutation importance (model-agnostic) ---
33perm_result = permutation_importance(
34    rf, X, y, n_repeats=10, random_state=42, n_jobs=-1
35)
36
37print("\n=== Permutation Importance ===")
38for name, imp, std in sorted(
39    zip(feature_names, perm_result.importances_mean,
40        perm_result.importances_std),
41    key=lambda x: x[1], reverse=True
42):
43    print(f"  {name:>15}: {imp:.4f} +/- {std:.4f}")
44
45print("\nNotice: Gini importance inflates 'random_id' due to its")
46print("high cardinality. Permutation importance correctly shows")
47print("it has near-zero importance.")

Never Trust Gini Importance Alone

Always validate built-in importance with permutation importance. In practice, built-in importance can rank noise features higher than true predictors when they have more unique values.

Permutation Importance

Permutation importance measures how much model performance degrades when a single feature's values are randomly shuffled. If shuffling a feature causes a large drop in accuracy, that feature is important.

Algorithm

1. Train the model and compute baseline score on validation data 2. For each feature: - Shuffle that feature's values (breaking its relationship with the target) - Re-score the model on the shuffled data - Importance = baseline score - shuffled score 3. Repeat multiple times and average for stability

Advantages

  • Works with any model (model-agnostic)
  • Accounts for all interactions the feature participates in
  • Uses the same metric you care about (accuracy, AUC, etc.)
  • Limitations

  • Slow for large datasets (requires re-evaluation per feature)
  • Correlated features can mask each other's importance
  • Measures importance on the validation set, not the training set
  • Partial Dependence Plots (PDP)

    A Partial Dependence Plot shows the marginal effect of one or two features on the model's predicted outcome, averaging over all other features. It answers: "What is the average prediction if I set feature X to value v?"

    How PDP Works

    For each value v of feature X_s: 1. Set X_s = v for all samples in the dataset 2. Compute the model prediction for each sample 3. Average all predictions

    This gives the average prediction as a function of X_s, marginalizing out all other features.

    Individual Conditional Expectation (ICE)

    ICE plots are the disaggregated version of PDP: instead of averaging, they show one line per sample. This reveals heterogeneous effects that PDP averages away.

    PDPICE
    One averaged lineOne line per sample
    Shows average effectShows individual effects
    Can hide interactionsReveals interactions
    Clean and simpleCan be cluttered

    python
    1import numpy as np
    2from sklearn.datasets import fetch_california_housing
    3from sklearn.ensemble import GradientBoostingRegressor
    4from sklearn.inspection import PartialDependenceDisplay
    5import warnings
    6warnings.filterwarnings("ignore")
    7
    8# --- Load California Housing dataset ---
    9housing = fetch_california_housing()
    10X, y = housing.data, housing.target
    11feature_names = housing.feature_names
    12
    13# --- Train a Gradient Boosting model ---
    14gbr = GradientBoostingRegressor(
    15    n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
    16)
    17gbr.fit(X, y)
    18
    19# --- Partial Dependence for top features ---
    20# Features: MedInc, AveRooms, HouseAge, Latitude
    21features_to_plot = [0, 3, 6, 7]  # MedInc, HouseAge, Latitude, Longitude
    22
    23print("=== Partial Dependence Values (MedInc) ===")
    24from sklearn.inspection import partial_dependence
    25
    26# Compute PDP for MedInc (feature 0)
    27pdp_result = partial_dependence(
    28    gbr, X, features=[0], kind="average", grid_resolution=20
    29)
    30
    31grid_values = pdp_result["grid_values"][0]
    32avg_predictions = pdp_result["average"][0]
    33
    34for val, pred in zip(grid_values[::4], avg_predictions[::4]):
    35    print(f"  MedInc = {val:.2f}  ->  Avg predicted price: ${pred * 100000:.0f}")
    36
    37# --- Compute ICE for MedInc ---
    38ice_result = partial_dependence(
    39    gbr, X[:50], features=[0], kind="individual", grid_resolution=20
    40)
    41
    42print("\n=== ICE: Individual predictions vary ===")
    43print(f"At MedInc={grid_values[0]:.1f}: "
    44      f"min=${ice_result['individual'][0][:, 0].min() * 100000:.0f}, "
    45      f"max=${ice_result['individual'][0][:, 0].max() * 100000:.0f}")
    46print(f"At MedInc={grid_values[-1]:.1f}: "
    47      f"min=${ice_result['individual'][0][:, -1].min() * 100000:.0f}, "
    48      f"max=${ice_result['individual'][0][:, -1].max() * 100000:.0f}")
    49print("\nICE reveals that the effect of income on price varies")
    50print("substantially across neighborhoods (heterogeneous effect).")

    Feature Interaction

    Features often interact: the effect of one feature depends on the value of another. For example, the effect of house age on price might differ between urban and rural areas.

    Detecting Interactions

    H-statistic (Friedman's H): Measures the proportion of variance in partial dependence explained by interactions. An H of 0 means no interaction; an H of 1 means the entire effect is due to interaction.

    2D Partial Dependence: Plot PDP over two features simultaneously to visualize their joint effect. If the surface is additive (a flat plane), there is no interaction. Curves and valleys indicate interactions.

    Practical Recommendations

    1. Start with permutation importance to identify top features 2. Use PDP to understand the direction and shape of each feature's effect 3. Use ICE plots to check for heterogeneous effects 4. Use 2D PDP or H-statistic to identify important interactions 5. Validate with SHAP (next lesson) for precise attributions

    Correlated Features

    When features are correlated, both PDP and permutation importance can give misleading results. PDP extrapolates to unlikely feature combinations (e.g., 10 rooms but 500 sq ft), and permutation importance can understate the importance of correlated features. Consider using SHAP or conditional permutation importance for correlated data.