Perturbation-based Feature Importance Methods

library(xplainfi)
library(mlr3)
library(mlr3learners)
library(data.table)
library(ggplot2)
library(DiagrammeR)

This vignette demonstrates the three perturbation-based feature importance methods implemented in xplainfi:

PFI (Permutation Feature Importance): Uses marginal sampling (simple permutation)
CFI (Conditional Feature Importance): Uses conditional sampling via Adversarial Random Forests
RFI (Relative Feature Importance): Uses conditional sampling on a user-specified subset of features

We’ll demonstrate these methods using three carefully designed scenarios that highlight their key differences.

# Common setup for all scenarios
learner <- lrn("regr.ranger", num.trees = 100)
resampling <- rsmp("cv", folds = 3)
measure <- msr("regr.mse")

Scenario 1: Interaction Effects

This scenario demonstrates how marginal methods (PFI) can miss important interaction effects that conditional methods (CFI) capture:

# Generate interaction scenario
task_int <- sim_dgp_interactions(n = 1000)
data_int <- task_int$data()

Causal Structure:

The key insight: x1 and x2 have NO direct effects - they affect y ONLY through their interaction (thick red arrow). However, PFI will still show them as important because permuting either feature destroys the crucial interaction term.

Analysis

Let’s analyze the interaction scenario where \(y = 2 \cdot x_1 \cdot x_2 + x_3 + \epsilon\). Note that x1 and x2 have NO main effects.

PFI on Interactions

pfi_int <- PFI$new(
  task = task_int,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5
)

# Compute importance scores
pfi_int_results <- pfi_int$compute(relation = "difference")
pfi_int_results
#> Key: <feature>
#>    feature  importance         sd
#>     <char>       <num>      <num>
#> 1:  noise1 0.011151109 0.06592808
#> 2:  noise2 0.007140595 0.04380357
#> 3:      x1 2.396084412 0.49513544
#> 4:      x2 2.081716910 0.37950171
#> 5:      x3 2.022106644 0.21062276

Expected: x1 and x2 will show high importance with PFI because permuting either feature destroys the interaction term x1×x2, which is crucial for prediction. This demonstrates a key limitation of PFI with interactions.

CFI on Interactions

CFI preserves the joint distribution, which should better capture the interaction effect:

# Create ARF sampler for the interaction task
sampler_int = ARFSampler$new(task = task_int, finite_bounds = "local")

cfi_int <- CFI$new(
  task = task_int,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5,
  sampler = sampler_int
)

# Compute importance scores
cfi_int_results <- cfi_int$compute(relation = "difference")
cfi_int_results
#> Key: <feature>
#>    feature  importance         sd
#>     <char>       <num>      <num>
#> 1:  noise1 -0.02125244 0.02681519
#> 2:  noise2 -0.02958837 0.03942009
#> 3:      x1  1.00940147 0.22695657
#> 4:      x2  1.03622368 0.18141070
#> 5:      x3  0.76545513 0.16371265

Expected: CFI should show somewhat lower importance for x1 and x2 compared to PFI because it better preserves the interaction structure during conditional sampling, providing a more nuanced importance estimate.

RFI on Interactions: Targeted Conditional Questions

RFI’s unique strength is answering specific conditional questions. Let’s explore what happens when we condition on different features:

# RFI conditioning on x2: "How important is x1 given we know x2?"
rfi_int_x2 <- RFI$new(
  task = task_int,
  learner = learner,
  measure = measure,
  resampling = resampling,
  conditioning_set = "x2",  # Condition on x2
  iters_perm = 5,
  sampler = sampler_int
)
rfi_int_x2_results <- rfi_int_x2$compute(relation = "difference")

# RFI conditioning on x1: "How important is x2 given we know x1?"  
rfi_int_x1 <- RFI$new(
  task = task_int,
  learner = learner,
  measure = measure,
  resampling = resampling,
  conditioning_set = "x1",  # Condition on x1
  iters_perm = 5,
  sampler = sampler_int
)
rfi_int_x1_results <- rfi_int_x1$compute(relation = "difference")

RFI Results:

x1 given x2: 2.095 (How important is x1 when we condition on x2)
x2 given x1: 1.912 (How important is x2 when we condition on x1)
x3 given x2: 1.964 (How important is x3 when we condition on x2)

Key insight: In the pure interaction case (y = 2·x1·x2 + x3), when we condition on one interacting feature, the other becomes extremely important because they only matter together. This demonstrates RFI’s power to answer targeted questions like “Given I already know x2, how much does x1 add?”

Comparing Methods on Interactions

Let’s compare how the methods handle the interaction:

RFI Conditional Summary: x1 given x2 has importance 2.095, x2 given x1 has importance 1.912, and x3 given x2 has importance 1.964. This shows how RFI reveals the conditional dependencies that pure marginal methods miss.

Key Insights: Interaction Effects

# Combine results and calculate ratios
comp_int <- rbindlist(list(
  pfi_int_results[, .(feature, importance, method = "PFI")],
  cfi_int_results[, .(feature, importance, method = "CFI")]
))

# Calculate the ratio of CFI to PFI importance for interacting features
int_ratio <- dcast(comp_int[feature %in% c("x1", "x2")], 
                   feature ~ method, value.var = "importance")
int_ratio[, cfi_pfi_ratio := CFI / PFI]
setnames(int_ratio, c("PFI", "CFI"), c("pfi_importance", "cfi_importance"))

int_ratio |> 
  knitr::kable(
    digits = 3,
    caption = "CFI vs PFI for Interacting Features"
  )

CFI vs PFI for Interacting Features
feature	cfi_importance	pfi_importance	cfi_pfi_ratio
x1	1.009	2.396	0.421
x2	1.036	2.082	0.498

Important insight about interaction effects: This example illustrates a crucial subtlety about PFI and interactions. While x1 and x2 have no main effects, PFI still correctly identifies them as important because permuting either feature destroys the interaction term x1×x2, which is crucial for prediction. The key limitation is that PFI cannot distinguish between main effects and interaction effects - it measures total contribution including through interactions.

Scenario 2: Confounding

This scenario shows how hidden confounders affect importance estimates and how conditioning can help:

# Generate confounding scenario  
task_conf <- sim_dgp_confounded(n = 1000)
data_conf <- task_conf$data()

Causal Structure:

The red arrows show the confounding paths: the hidden confounder creates spurious correlations between x1, x2, proxy, and y. The blue arrows show true direct causal effects. Note that independent is truly independent (no confounding) while proxy provides a noisy measurement of the confounder.

In the observable confounder scenario (used later), the confounder H would be included as a feature in the dataset, allowing direct conditioning rather than relying on the noisy proxy.

Key insight: The hidden confounder creates spurious correlations between x1, x2, and y (red paths), making them appear more important than they truly are. RFI conditioning on the proxy (which measures the confounder) should help isolate the true direct effects (blue paths).

Analysis

Now let’s analyze the confounding scenario where a hidden confounder affects both features and the outcome.

PFI on Confounded Data

pfi_conf <- PFI$new(
  task = task_conf,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5
)

pfi_conf_results <- pfi_conf$compute(relation = "difference")
pfi_conf_results
#> Key: <feature>
#>        feature importance         sd
#>         <char>      <num>      <num>
#> 1: independent  1.5558886 0.14955983
#> 2:       proxy  0.2164108 0.06458989
#> 3:          x1  1.6184945 0.17254140
#> 4:          x2  2.0832008 0.43771483

RFI Conditioning on Proxy

RFI can condition on the proxy to help isolate direct effects:

# Create sampler for confounding task
sampler_conf = ARFSampler$new(
  task = task_conf,
  verbose = FALSE,
  finite_bounds = "local"
)

# RFI conditioning on the proxy
rfi_conf <- RFI$new(
  task = task_conf,
  learner = learner,
  measure = measure,
  resampling = resampling,
  conditioning_set = "proxy",  # Condition on proxy to reduce confounding
  iters_perm = 5,
  sampler = sampler_conf
)

rfi_conf_results <- rfi_conf$compute(relation = "difference")
rfi_conf_results
#> Key: <feature>
#>        feature importance         sd
#>         <char>      <num>      <num>
#> 1: independent  1.5370390 0.12262544
#> 2:       proxy  0.0000000 0.00000000
#> 3:          x1  0.5846276 0.08205033
#> 4:          x2  0.7221616 0.09826116

Also trying CFI for comparison

cfi_conf <- CFI$new(
  task = task_conf,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5,
  sampler = sampler_conf
)

cfi_conf_results <- cfi_conf$compute(relation = "difference")
cfi_conf_results
#> Key: <feature>
#>        feature importance         sd
#>         <char>      <num>      <num>
#> 1: independent  1.5247804 0.15991909
#> 2:       proxy  0.0000000 0.00000000
#> 3:          x1  0.6137199 0.07982261
#> 4:          x2  0.6813112 0.08152289

Observable Confounder Scenario

In many real-world situations, confounders are actually observable (e.g., demographics, baseline characteristics). Let’s explore how RFI performs when we can condition directly on the true confounder:

# Generate scenario where confounder is observable
task_conf_obs <- sim_dgp_confounded(n = 1000, hidden = FALSE)

# Now we can condition directly on the true confounder
sampler_conf_obs = ARFSampler$new(
  task = task_conf_obs,
  verbose = FALSE,
  finite_bounds = "local"
)

# RFI conditioning on the true confounder (not just proxy)
rfi_conf_obs <- RFI$new(
  task = task_conf_obs,
  learner = learner,
  measure = measure,
  resampling = resampling,
  conditioning_set = "confounder",  # Condition on true confounder
  iters_perm = 5,
  sampler = sampler_conf_obs
)

rfi_conf_obs_results <- rfi_conf_obs$compute(relation = "difference")

# Compare with PFI on the same data
pfi_conf_obs <- PFI$new(
  task = task_conf_obs,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5
)
pfi_conf_obs_results <- pfi_conf_obs$compute(relation = "difference")

Key Results:

x1 importance: PFI = 0.594, RFI|confounder = 0.098
x2 importance: PFI = 0.612, RFI|confounder = 0.142
independent importance: PFI = 1.444, RFI|confounder = 1.407

Insight: When conditioning on the true confounder, RFI should show reduced importance for x1 and x2 (since much of their apparent importance was due to confounding) while independent maintains its importance (since it’s truly causally related to y).

Comparing Methods on Confounding

Key Insights: Confounding Effects

# Show how conditioning affects importance estimates
conf_wide <- dcast(comp_conf_long, feature ~ method, value.var = "importance")
conf_summary <- conf_wide[, .(
  feature,
  pfi_importance = round(PFI, 3),
  cfi_importance = round(CFI, 3),
  rfi_proxy_importance = round(RFI, 3),
  pfi_rfi_diff = round(PFI - RFI, 3)
)]

conf_summary |> 
  knitr::kable(
    caption = "Effect of Conditioning on Proxy in Confounded Scenario"
  )

Effect of Conditioning on Proxy in Confounded Scenario
feature	pfi_importance	cfi_importance	rfi_proxy_importance	pfi_rfi_diff
independent	1.556	1.525	1.537	0.019
proxy	0.216	0.000	0.000	0.216
x1	1.618	0.614	0.585	1.034
x2	2.083	0.681	0.722	1.361

In the confounding scenario, we observed:

PFI shows confounded effects: Without accounting for confounders, PFI overestimates the importance of x1 and x2 due to their spurious correlation with y through the hidden confounder.
RFI conditioning on proxy reduces bias: By conditioning on the proxy (noisy measurement of the confounder), RFI can partially isolate direct effects, though some confounding remains due to measurement error.
RFI conditioning on true confounder removes bias: When the confounder is observable and we can condition directly on it, RFI dramatically reduces the apparent importance of x1 and x2, revealing their true direct effects.
CFI partially accounts for confounding: Through its conditional sampling, CFI captures some of the confounding structure but cannot target specific confounders like RFI can.

Scenario 3: Independent Features (Baseline)

To provide a baseline comparison, let’s examine a scenario where all feature importance methods should produce similar results:

# Generate independent features scenario
task_ind <- sim_dgp_independent(n = 1000)
data_ind <- task_ind$data()

Causal Structure:

This is the simplest scenario: all features are independent, there are no interactions, and no confounding. Each feature has only a direct effect on y (or no effect in the case of noise).

Running All Methods on Independent Features

First PFI:

# PFI
pfi_ind <- PFI$new(
  task = task_ind,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5
)
pfi_ind_results <- pfi_ind$compute(relation = "difference")

Now CFI with the ARF sampler:

sampler_ind = ARFSampler$new(task = task_ind, finite_bounds = "local")
cfi_ind <- CFI$new(
  task = task_ind,
  learner = learner,
  measure = measure,
  resampling = resampling,
  iters_perm = 5,
  sampler = sampler_ind
)
cfi_ind_results <- cfi_ind$compute(relation = "difference")

RFI with empty conditioning set, basically equivalent to PFI with a different sampler:

rfi_ind <- RFI$new(
  task = task_ind,
  learner = learner,
  measure = measure,
  resampling = resampling,
  conditioning_set = character(0),  # Empty set
  iters_perm = 5,
  sampler = sampler_ind
)
rfi_ind_results <- rfi_ind$compute(relation = "difference")

And now we visualize:

Agreement Between Methods

# Calculate coefficient of variation for each feature across methods
comp_ind_wide <- dcast(comp_ind_long, feature ~ method, value.var = "importance")
comp_ind_wide[, `:=`(
  mean_importance = rowMeans(.SD),
  sd_importance = apply(.SD, 1, sd),
  cv = apply(.SD, 1, sd) / rowMeans(.SD)
), .SDcols = c("PFI", "CFI", "RFI")]

comp_ind_wide[, .(
  feature,
  mean_importance = round(mean_importance, 3),
  cv = round(cv, 3),
  agreement = ifelse(cv < 0.1, "High", ifelse(cv < 0.2, "Moderate", "Low"))
)] |>
  knitr::kable(
    caption = "Method Agreement on Independent Features",
    col.names = c("Feature", "Mean Importance", "Coef. of Variation", "Agreement Level")
  )

Method Agreement on Independent Features
Feature	Mean Importance	Coef. of Variation	Agreement Level
important1	5.669	0.324	Low
important2	1.212	0.345	Low
important3	0.279	0.299	Low
unimportant1	0.003	3.755	Low
unimportant2	-0.003	-1.439	High

Key insight: With independent features and no complex relationships, all three methods (PFI, CFI, RFI) produce very similar importance estimates. This confirms that the differences we observe in Scenarios 1 and 2 are truly due to interactions and confounding, not artifacts of the methods themselves.

Key Insights: Independent Features

In the baseline scenario with independent features:

All methods agree: PFI, CFI, and RFI produce nearly identical importance estimates when features are truly independent.
Validates methodology: The agreement between methods confirms that differences in other scenarios are due to data structure, not method artifacts.
Noise correctly identified: All methods correctly assign near-zero importance to the noise features.

Key Takeaways

Through these three scenarios, we’ve demonstrated:

Method choice matters:
- PFI is simple and fast but can miss interaction effects and be affected by confounding
- CFI captures feature dependencies and interactions through conditional sampling
- RFI allows targeted conditioning to isolate specific relationships
When to use each method:
- Use PFI when features are believed to be independent (as in Scenario 3) and you want a quick baseline importance ranking
- Use CFI when you suspect feature interactions or dependencies (as in Scenario 1) and want a sophisticated analysis that respects feature relationships
- Use RFI when you have specific conditional questions: “How important is feature X given I already know feature Y?” (as in Scenarios 1 & 2). Essential for feature selection and understanding incremental value.
Practical considerations:
- All methods benefit from cross-validation and multiple permutation iterations for stability
- ARF-based conditional sampling (used in CFI/RFI) is more computationally intensive than marginal sampling
- The choice of conditioning set in RFI requires domain knowledge

Scenario 1: Interaction Effects

Analysis

PFI on Interactions

CFI on Interactions

RFI on Interactions: Targeted Conditional Questions

Comparing Methods on Interactions

Key Insights: Interaction Effects

Scenario 2: Confounding

Analysis

PFI on Confounded Data

RFI Conditioning on Proxy

Also trying CFI for comparison

Observable Confounder Scenario

Comparing Methods on Confounding

Key Insights: Confounding Effects

Scenario 3: Independent Features (Baseline)

Running All Methods on Independent Features

Agreement Between Methods

Key Insights: Independent Features

Key Takeaways

Further Reading