Perturbation-based Feature Importance Methods
Source:vignettes/articles/perturbation-importance.Rmd
perturbation-importance.Rmd
library(xplainfi)
library(mlr3)
library(mlr3learners)
library(data.table)
library(ggplot2)
library(DiagrammeR)
This vignette demonstrates the three perturbation-based feature importance methods implemented in xplainfi:
- PFI (Permutation Feature Importance): Uses marginal sampling (simple permutation)
- CFI (Conditional Feature Importance): Uses conditional sampling via Adversarial Random Forests
- RFI (Relative Feature Importance): Uses conditional sampling on a user-specified subset of features
We’ll demonstrate these methods using three carefully designed scenarios that highlight their key differences.
# Common setup for all scenarios
learner <- lrn("regr.ranger", num.trees = 100)
resampling <- rsmp("cv", folds = 3)
measure <- msr("regr.mse")
Scenario 1: Interaction Effects
This scenario demonstrates how marginal methods (PFI) can miss important interaction effects that conditional methods (CFI) capture:
# Generate interaction scenario
task_int <- sim_dgp_interactions(n = 1000)
data_int <- task_int$data()
Causal Structure:
The key insight: x1 and x2 have NO direct effects - they affect y ONLY through their interaction (thick red arrow). However, PFI will still show them as important because permuting either feature destroys the crucial interaction term.
Analysis
Let’s analyze the interaction scenario where \(y = 2 \cdot x_1 \cdot x_2 + x_3 + \epsilon\). Note that x1 and x2 have NO main effects.
PFI on Interactions
pfi_int <- PFI$new(
task = task_int,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5
)
# Compute importance scores
pfi_int_results <- pfi_int$compute(relation = "difference")
pfi_int_results
#> Key: <feature>
#> feature importance sd
#> <char> <num> <num>
#> 1: noise1 0.011151109 0.06592808
#> 2: noise2 0.007140595 0.04380357
#> 3: x1 2.396084412 0.49513544
#> 4: x2 2.081716910 0.37950171
#> 5: x3 2.022106644 0.21062276
Expected: x1 and x2 will show high importance with PFI because permuting either feature destroys the interaction term x1×x2, which is crucial for prediction. This demonstrates a key limitation of PFI with interactions.
CFI on Interactions
CFI preserves the joint distribution, which should better capture the interaction effect:
# Create ARF sampler for the interaction task
sampler_int = ARFSampler$new(task = task_int, finite_bounds = "local")
cfi_int <- CFI$new(
task = task_int,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5,
sampler = sampler_int
)
# Compute importance scores
cfi_int_results <- cfi_int$compute(relation = "difference")
cfi_int_results
#> Key: <feature>
#> feature importance sd
#> <char> <num> <num>
#> 1: noise1 -0.02125244 0.02681519
#> 2: noise2 -0.02958837 0.03942009
#> 3: x1 1.00940147 0.22695657
#> 4: x2 1.03622368 0.18141070
#> 5: x3 0.76545513 0.16371265
Expected: CFI should show somewhat lower importance for x1 and x2 compared to PFI because it better preserves the interaction structure during conditional sampling, providing a more nuanced importance estimate.
RFI on Interactions: Targeted Conditional Questions
RFI’s unique strength is answering specific conditional questions. Let’s explore what happens when we condition on different features:
# RFI conditioning on x2: "How important is x1 given we know x2?"
rfi_int_x2 <- RFI$new(
task = task_int,
learner = learner,
measure = measure,
resampling = resampling,
conditioning_set = "x2", # Condition on x2
iters_perm = 5,
sampler = sampler_int
)
rfi_int_x2_results <- rfi_int_x2$compute(relation = "difference")
# RFI conditioning on x1: "How important is x2 given we know x1?"
rfi_int_x1 <- RFI$new(
task = task_int,
learner = learner,
measure = measure,
resampling = resampling,
conditioning_set = "x1", # Condition on x1
iters_perm = 5,
sampler = sampler_int
)
rfi_int_x1_results <- rfi_int_x1$compute(relation = "difference")
RFI Results:
- x1 given x2: 2.095 (How important is x1 when we condition on x2)
-
x2 given x1: 1.912 (How important is x2 when we
condition on x1)
- x3 given x2: 1.964 (How important is x3 when we condition on x2)
Key insight: In the pure interaction case (y = 2·x1·x2 + x3), when we condition on one interacting feature, the other becomes extremely important because they only matter together. This demonstrates RFI’s power to answer targeted questions like “Given I already know x2, how much does x1 add?”
Comparing Methods on Interactions
Let’s compare how the methods handle the interaction:
RFI Conditional Summary: x1 given x2 has importance 2.095, x2 given x1 has importance 1.912, and x3 given x2 has importance 1.964. This shows how RFI reveals the conditional dependencies that pure marginal methods miss.
Key Insights: Interaction Effects
# Combine results and calculate ratios
comp_int <- rbindlist(list(
pfi_int_results[, .(feature, importance, method = "PFI")],
cfi_int_results[, .(feature, importance, method = "CFI")]
))
# Calculate the ratio of CFI to PFI importance for interacting features
int_ratio <- dcast(comp_int[feature %in% c("x1", "x2")],
feature ~ method, value.var = "importance")
int_ratio[, cfi_pfi_ratio := CFI / PFI]
setnames(int_ratio, c("PFI", "CFI"), c("pfi_importance", "cfi_importance"))
int_ratio |>
knitr::kable(
digits = 3,
caption = "CFI vs PFI for Interacting Features"
)
feature | cfi_importance | pfi_importance | cfi_pfi_ratio |
---|---|---|---|
x1 | 1.009 | 2.396 | 0.421 |
x2 | 1.036 | 2.082 | 0.498 |
Important insight about interaction effects: This example illustrates a crucial subtlety about PFI and interactions. While x1 and x2 have no main effects, PFI still correctly identifies them as important because permuting either feature destroys the interaction term x1×x2, which is crucial for prediction. The key limitation is that PFI cannot distinguish between main effects and interaction effects - it measures total contribution including through interactions.
Scenario 2: Confounding
This scenario shows how hidden confounders affect importance estimates and how conditioning can help:
# Generate confounding scenario
task_conf <- sim_dgp_confounded(n = 1000)
data_conf <- task_conf$data()
Causal Structure:
The red arrows show the confounding paths: the
hidden confounder creates spurious correlations between x1, x2, proxy,
and y. The blue arrows show true direct causal effects.
Note that independent
is truly independent (no confounding)
while proxy
provides a noisy measurement of the
confounder.
In the observable confounder scenario (used later), the confounder H would be included as a feature in the dataset, allowing direct conditioning rather than relying on the noisy proxy.
Key insight: The hidden confounder creates spurious correlations between x1, x2, and y (red paths), making them appear more important than they truly are. RFI conditioning on the proxy (which measures the confounder) should help isolate the true direct effects (blue paths).
Analysis
Now let’s analyze the confounding scenario where a hidden confounder affects both features and the outcome.
PFI on Confounded Data
pfi_conf <- PFI$new(
task = task_conf,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5
)
pfi_conf_results <- pfi_conf$compute(relation = "difference")
pfi_conf_results
#> Key: <feature>
#> feature importance sd
#> <char> <num> <num>
#> 1: independent 1.5558886 0.14955983
#> 2: proxy 0.2164108 0.06458989
#> 3: x1 1.6184945 0.17254140
#> 4: x2 2.0832008 0.43771483
RFI Conditioning on Proxy
RFI can condition on the proxy to help isolate direct effects:
# Create sampler for confounding task
sampler_conf = ARFSampler$new(
task = task_conf,
verbose = FALSE,
finite_bounds = "local"
)
# RFI conditioning on the proxy
rfi_conf <- RFI$new(
task = task_conf,
learner = learner,
measure = measure,
resampling = resampling,
conditioning_set = "proxy", # Condition on proxy to reduce confounding
iters_perm = 5,
sampler = sampler_conf
)
rfi_conf_results <- rfi_conf$compute(relation = "difference")
rfi_conf_results
#> Key: <feature>
#> feature importance sd
#> <char> <num> <num>
#> 1: independent 1.5370390 0.12262544
#> 2: proxy 0.0000000 0.00000000
#> 3: x1 0.5846276 0.08205033
#> 4: x2 0.7221616 0.09826116
Also trying CFI for comparison
cfi_conf <- CFI$new(
task = task_conf,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5,
sampler = sampler_conf
)
cfi_conf_results <- cfi_conf$compute(relation = "difference")
cfi_conf_results
#> Key: <feature>
#> feature importance sd
#> <char> <num> <num>
#> 1: independent 1.5247804 0.15991909
#> 2: proxy 0.0000000 0.00000000
#> 3: x1 0.6137199 0.07982261
#> 4: x2 0.6813112 0.08152289
Observable Confounder Scenario
In many real-world situations, confounders are actually observable (e.g., demographics, baseline characteristics). Let’s explore how RFI performs when we can condition directly on the true confounder:
# Generate scenario where confounder is observable
task_conf_obs <- sim_dgp_confounded(n = 1000, hidden = FALSE)
# Now we can condition directly on the true confounder
sampler_conf_obs = ARFSampler$new(
task = task_conf_obs,
verbose = FALSE,
finite_bounds = "local"
)
# RFI conditioning on the true confounder (not just proxy)
rfi_conf_obs <- RFI$new(
task = task_conf_obs,
learner = learner,
measure = measure,
resampling = resampling,
conditioning_set = "confounder", # Condition on true confounder
iters_perm = 5,
sampler = sampler_conf_obs
)
rfi_conf_obs_results <- rfi_conf_obs$compute(relation = "difference")
# Compare with PFI on the same data
pfi_conf_obs <- PFI$new(
task = task_conf_obs,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5
)
pfi_conf_obs_results <- pfi_conf_obs$compute(relation = "difference")
Key Results:
- x1 importance: PFI = 0.594, RFI|confounder = 0.098
-
x2 importance: PFI = 0.612, RFI|confounder =
0.142
- independent importance: PFI = 1.444, RFI|confounder = 1.407
Insight: When conditioning on the true confounder, RFI should show reduced importance for x1 and x2 (since much of their apparent importance was due to confounding) while independent maintains its importance (since it’s truly causally related to y).
Key Insights: Confounding Effects
# Show how conditioning affects importance estimates
conf_wide <- dcast(comp_conf_long, feature ~ method, value.var = "importance")
conf_summary <- conf_wide[, .(
feature,
pfi_importance = round(PFI, 3),
cfi_importance = round(CFI, 3),
rfi_proxy_importance = round(RFI, 3),
pfi_rfi_diff = round(PFI - RFI, 3)
)]
conf_summary |>
knitr::kable(
caption = "Effect of Conditioning on Proxy in Confounded Scenario"
)
feature | pfi_importance | cfi_importance | rfi_proxy_importance | pfi_rfi_diff |
---|---|---|---|---|
independent | 1.556 | 1.525 | 1.537 | 0.019 |
proxy | 0.216 | 0.000 | 0.000 | 0.216 |
x1 | 1.618 | 0.614 | 0.585 | 1.034 |
x2 | 2.083 | 0.681 | 0.722 | 1.361 |
In the confounding scenario, we observed:
PFI shows confounded effects: Without accounting for confounders, PFI overestimates the importance of x1 and x2 due to their spurious correlation with y through the hidden confounder.
RFI conditioning on proxy reduces bias: By conditioning on the proxy (noisy measurement of the confounder), RFI can partially isolate direct effects, though some confounding remains due to measurement error.
RFI conditioning on true confounder removes bias: When the confounder is observable and we can condition directly on it, RFI dramatically reduces the apparent importance of x1 and x2, revealing their true direct effects.
CFI partially accounts for confounding: Through its conditional sampling, CFI captures some of the confounding structure but cannot target specific confounders like RFI can.
Scenario 3: Independent Features (Baseline)
To provide a baseline comparison, let’s examine a scenario where all feature importance methods should produce similar results:
# Generate independent features scenario
task_ind <- sim_dgp_independent(n = 1000)
data_ind <- task_ind$data()
Causal Structure:
This is the simplest scenario: all features are independent, there are no interactions, and no confounding. Each feature has only a direct effect on y (or no effect in the case of noise).
Running All Methods on Independent Features
First PFI:
# PFI
pfi_ind <- PFI$new(
task = task_ind,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5
)
pfi_ind_results <- pfi_ind$compute(relation = "difference")
Now CFI with the ARF sampler:
sampler_ind = ARFSampler$new(task = task_ind, finite_bounds = "local")
cfi_ind <- CFI$new(
task = task_ind,
learner = learner,
measure = measure,
resampling = resampling,
iters_perm = 5,
sampler = sampler_ind
)
cfi_ind_results <- cfi_ind$compute(relation = "difference")
RFI with empty conditioning set, basically equivalent to PFI with a different sampler:
rfi_ind <- RFI$new(
task = task_ind,
learner = learner,
measure = measure,
resampling = resampling,
conditioning_set = character(0), # Empty set
iters_perm = 5,
sampler = sampler_ind
)
rfi_ind_results <- rfi_ind$compute(relation = "difference")
And now we visualize:
Agreement Between Methods
# Calculate coefficient of variation for each feature across methods
comp_ind_wide <- dcast(comp_ind_long, feature ~ method, value.var = "importance")
comp_ind_wide[, `:=`(
mean_importance = rowMeans(.SD),
sd_importance = apply(.SD, 1, sd),
cv = apply(.SD, 1, sd) / rowMeans(.SD)
), .SDcols = c("PFI", "CFI", "RFI")]
comp_ind_wide[, .(
feature,
mean_importance = round(mean_importance, 3),
cv = round(cv, 3),
agreement = ifelse(cv < 0.1, "High", ifelse(cv < 0.2, "Moderate", "Low"))
)] |>
knitr::kable(
caption = "Method Agreement on Independent Features",
col.names = c("Feature", "Mean Importance", "Coef. of Variation", "Agreement Level")
)
Feature | Mean Importance | Coef. of Variation | Agreement Level |
---|---|---|---|
important1 | 5.669 | 0.324 | Low |
important2 | 1.212 | 0.345 | Low |
important3 | 0.279 | 0.299 | Low |
unimportant1 | 0.003 | 3.755 | Low |
unimportant2 | -0.003 | -1.439 | High |
Key insight: With independent features and no complex relationships, all three methods (PFI, CFI, RFI) produce very similar importance estimates. This confirms that the differences we observe in Scenarios 1 and 2 are truly due to interactions and confounding, not artifacts of the methods themselves.
Key Insights: Independent Features
In the baseline scenario with independent features:
All methods agree: PFI, CFI, and RFI produce nearly identical importance estimates when features are truly independent.
Validates methodology: The agreement between methods confirms that differences in other scenarios are due to data structure, not method artifacts.
Noise correctly identified: All methods correctly assign near-zero importance to the noise features.
Key Takeaways
Through these three scenarios, we’ve demonstrated:
-
Method choice matters:
- PFI is simple and fast but can miss interaction effects and be affected by confounding
- CFI captures feature dependencies and interactions through conditional sampling
- RFI allows targeted conditioning to isolate specific relationships
-
When to use each method:
- Use PFI when features are believed to be independent (as in Scenario 3) and you want a quick baseline importance ranking
- Use CFI when you suspect feature interactions or dependencies (as in Scenario 1) and want a sophisticated analysis that respects feature relationships
- Use RFI when you have specific conditional questions: “How important is feature X given I already know feature Y?” (as in Scenarios 1 & 2). Essential for feature selection and understanding incremental value.
-
Practical considerations:
- All methods benefit from cross-validation and multiple permutation iterations for stability
- ARF-based conditional sampling (used in CFI/RFI) is more computationally intensive than marginal sampling
- The choice of conditioning set in RFI requires domain knowledge
Further Reading
For more details on these methods and their theoretical foundations, see:
- Breiman (2001) for the original PFI formulation
- Strobl et al. (2008) for limitations of PFI with correlated
features
- Watson & Wright (2021) for conditional sampling with ARF
- König et al. (2021) for relative feature importance
- Ewald et al. (2024) for a comprehensive review of feature importance methods