These data generating processes (DGPs) are designed to illustrate specific strengths and weaknesses of different feature importance methods like PFI, CFI, and RFI. Each DGP focuses on one primary challenge to make the differences between methods clear.
Usage
sim_dgp_correlated(n = 500L)
sim_dgp_mediated(n = 500L)
sim_dgp_confounded(n = 500L, hidden = TRUE)
sim_dgp_interactions(n = 500L)
sim_dgp_independent(n = 500L)
Value
A regression task (mlr3::TaskRegr) with data.table backend.
Details
Correlated Features DGP: This DGP creates highly correlated predictors where PFI will show artificially low importance due to redundancy, while CFI will correctly identify each feature's conditional contribution.
Mathematical Model: $$X_1 \sim N(0,1)$$ $$X_2 = X_1 + \varepsilon_2, \quad \varepsilon_2 \sim N(0, 0.05^2)$$ $$X_3 \sim N(0,1), \quad X_4 \sim N(0,1)$$ $$Y = 2 \cdot X_1 + X_3 + \varepsilon$$ where \(\varepsilon \sim N(0, 0.2^2)\).
Feature Properties:
x1
: Standard normal, direct causal effect on y (β=2.0)x2
: Nearly perfect copy of x1 (x1 + small noise), NO causal effect on y (β=0)x3
: Independent standard normal, direct causal effect on y (β=1.0)x4
: Independent standard normal, no effect on y (β=0)
Expected Behavior:
Marginal methods (PFI, Marginal SAGE): Will falsely assign importance to x2 due to correlation with x1
Conditional methods (CFI, Conditional SAGE): Should correctly assign near-zero importance to x2
Key insight: x2 is a "spurious predictor" - correlated with causal feature but not causal itself
Mediated Effects DGP: This DGP demonstrates the difference between total and direct causal effects. Some features affect the outcome only through mediators.
Mathematical Model: $$\text{exposure} \sim N(0,1), \quad \text{direct} \sim N(0,1)$$ $$\text{mediator} = 0.8 \cdot \text{exposure} + 0.6 \cdot \text{direct} + \varepsilon_m$$ $$Y = 1.5 \cdot \text{mediator} + 0.5 \cdot \text{direct} + \varepsilon$$ where \(\varepsilon_m \sim N(0, 0.3^2)\) and \(\varepsilon \sim N(0, 0.2^2)\).
Feature Properties:
exposure
: Has no direct effect on y, only through mediator (total effect = 1.2)mediator
: Mediates the effect of exposure on ydirect
: Has both direct effect on y and effect on mediatornoise
: No causal relationship to y
Causal Structure: exposure → mediator → y ← direct → mediator
Expected Behavior:
PFI: Shows total effects (exposure appears important)
CFI: Shows direct effects (exposure appears less important when conditioning on mediator)
RFI with mediator: Should show direct effects similar to CFI
Confounding DGP: This DGP includes a confounder that affects both features and the outcome. Uses simple coefficients for easy interpretation.
Mathematical Model: $$H \sim N(0,1)$$ $$X_1 = H + \varepsilon_1, \quad X_2 = H + \varepsilon_2$$ $$\text{proxy} = H + \varepsilon_p, \quad \text{independent} \sim N(0,1)$$ $$Y = H + 0.5 \cdot X_1 + 0.5 \cdot X_2 + \text{independent} + \varepsilon$$ where all \(\varepsilon \sim N(0, 0.5^2)\) independently.
Model Structure:
Confounder H ~ N(0,1) (dashed red node = potentially unobserved)
x1 = H + noise, x2 = H + noise (both affected by confounder)
proxy = H + noise (noisy measurement of confounder)
independent ~ N(0,1) (truly independent)
y = H + 0.5x1 + 0.5x2 + independent + noise
Expected Behavior:
PFI: Will show inflated importance for x1 and x2 due to confounding
CFI: Should partially account for confounding through conditional sampling
RFI conditioning on confounder/proxy: Should reduce confounding bias
Interaction Effects DGP: This DGP demonstrates a pure interaction effect where features have no main effects.
Mathematical Model: $$Y = 2 \cdot X_1 \cdot X_2 + X_3 + \varepsilon$$ where \(X_j \sim N(0,1)\) independently and \(\varepsilon \sim N(0, 0.5^2)\).
Feature Properties:
x1
,x2
: Independent features with ONLY interaction effect (no main effects)x3
: Independent feature with main effect onlynoise1
,noise2
: No causal effects
Expected Behavior:
PFI: Should assign near-zero importance to x1 and x2 (no marginal effect)
CFI: Should capture the interaction and assign high importance to x1 and x2
Ground truth: x1 and x2 are important ONLY through their interaction
Independent Features DGP: This is a baseline scenario where all features are independent and their effects are additive. All importance methods should give similar results.
Mathematical Model: $$Y = 2.0 \cdot X_1 + 1.0 \cdot X_2 + 0.5 \cdot X_3 + \varepsilon$$ where \(X_j \sim N(0,1)\) independently and \(\varepsilon \sim N(0, 0.2^2)\).
Feature Properties:
important1-3
: Independent features with different effect sizesunimportant1-2
: Independent noise features with no effect
Expected Behavior:
All methods: Should rank features consistently by their true effect sizes
Ground truth: important1 > important2 > important3 > unimportant1,2 ≈ 0
Functions
sim_dgp_correlated()
: Correlated features demonstrating PFI's limitationssim_dgp_mediated()
: Mediated effects showing direct vs total importancesim_dgp_confounded()
: Confounding scenario for conditional samplingsim_dgp_interactions()
: Interaction effects between featuressim_dgp_independent()
: Independent features baseline scenario
References
Ewald, Katharina F, Bothmann, Ludwig, Wright, N. M, Bischl, Bernd, Casalicchio, Giuseppe, König, Gunnar (2024). “A Guide to Feature Importance Methods for Scientific Inference.” In Longo, Luca, Lapuschkin, Sebastian, Seifert, Christin (eds.), Explainable Artificial Intelligence, 440–464. ISBN 978-3-031-63797-1, doi:10.1007/978-3-031-63797-1_22 .
Examples
task = sim_dgp_correlated(200)
task$data()
#> y x1 x2 x3 x4
#> <num> <num> <num> <num> <num>
#> 1: -1.7322190 -0.60456326 -0.6061328 -0.5926478 1.28059571
#> 2: 0.6484470 0.54272676 0.5378335 -0.3739125 1.05379921
#> 3: -2.6339016 -2.12811065 -2.1824885 1.4839135 -1.15881591
#> 4: 2.4829718 1.14827217 1.0964226 0.1198225 2.05058602
#> 5: -1.3048091 -1.13363444 -1.1593766 0.9747457 -0.04264243
#> ---
#> 196: -2.2081298 -1.32872128 -1.2972253 0.5468677 0.26051622
#> 197: -0.5381338 0.26173594 0.2041503 -1.0758494 -0.73128217
#> 198: -1.2004834 -0.85525389 -0.7914119 0.2770612 1.63629555
#> 199: -0.4296322 -0.05907189 -0.1112073 -0.3672032 -1.11308245
#> 200: -1.2755750 -0.98834804 -0.9979079 0.3806760 1.21603999
task = sim_dgp_mediated(200)
task$data()
#> y direct exposure mediator noise
#> <num> <num> <num> <num> <num>
#> 1: 0.2682214 0.56191892 -0.10254907 0.02007031 -0.94036897
#> 2: 0.3668025 -0.31209517 0.63128573 0.51706196 0.21576204
#> 3: -1.0929475 1.73243974 -2.72664296 -1.27732973 -0.65017597
#> 4: 0.4884491 -1.11063673 1.56917512 0.62158131 0.86256694
#> 5: -1.5633866 -0.91107455 -0.56755202 -0.59348945 0.02049361
#> ---
#> 196: -0.4620635 -0.07556055 -0.32668610 -0.44129661 1.60055080
#> 197: -1.8432169 -1.19647625 0.02769335 -0.74040304 -0.78024005
#> 198: -0.5464554 -0.17247334 -0.06216673 -0.23886068 1.53259721
#> 199: -1.5072177 -1.24870683 -0.11379067 -0.74733810 1.21986495
#> 200: -0.4064812 -0.40525625 -0.18539345 -0.08601852 0.42092297
# Hidden confounder scenario (traditional)
task_hidden = sim_dgp_confounded(200, hidden = TRUE)
task_hidden$feature_names # proxy available but not confounder
#> [1] "independent" "proxy" "x1" "x2"
# Observable confounder scenario
task_observed = sim_dgp_confounded(200, hidden = FALSE)
task_observed$feature_names # both confounder and proxy available
#> [1] "confounder" "independent" "proxy" "x1" "x2"
task = sim_dgp_interactions(200)
task$data()
#> y noise1 noise2 x1 x2 x3
#> <num> <num> <num> <num> <num> <num>
#> 1: 0.2882586 -0.37960553 -0.3739991 -0.25362947 0.59401240 1.0043550
#> 2: -7.8853924 1.39837208 -0.6964924 1.33486566 -3.30233616 0.2603352
#> 3: 0.2002939 -0.03480718 -0.4634789 -1.32191470 0.51263677 0.8332858
#> 4: 0.3009213 -0.18565220 -0.1751952 0.03945414 -0.97355571 0.9688906
#> 5: -4.5948816 -0.90781890 1.6837376 -2.70224420 0.95331530 0.5151201
#> ---
#> 196: -1.0920823 -0.23711502 -1.1875526 0.72957205 0.42989528 -1.1779584
#> 197: -0.6880263 0.48588642 -0.3728218 -0.65807045 -0.02781163 -1.1327006
#> 198: 1.5429039 -1.90272571 -0.3375957 0.89850044 1.16877170 0.3568363
#> 199: 1.6566115 1.14957381 1.0715333 -0.82801376 -0.49177456 -0.2701985
#> 200: 0.6476449 0.93304157 0.3868809 -0.29954503 -0.42508208 -0.7609080
task = sim_dgp_independent(200)
task$data()
#> y important1 important2 important3 unimportant1 unimportant2
#> <num> <num> <num> <num> <num> <num>
#> 1: -0.0325243 -0.09121832 0.474863184 -0.2094373 0.81060399 1.52830646
#> 2: 2.1896063 -0.11532241 1.357043026 1.8807642 0.14970107 -0.06600785
#> 3: 0.6906556 0.02936029 -0.006302365 1.5601607 0.26935084 1.27552439
#> 4: -0.8811282 -0.46736058 -0.054498032 0.3431985 0.02028685 -0.59280356
#> 5: -0.2798828 -0.38121185 0.667607562 -0.2301627 2.08106071 0.92570012
#> ---
#> 196: 0.1914504 0.25542333 -0.290536884 0.1000515 -1.18489353 0.15701063
#> 197: 4.0914494 1.32830943 1.070912320 0.4919225 2.37333159 -0.72147874
#> 198: -0.2997712 -0.39045865 0.301001003 0.3766047 -0.28165309 -0.06266584
#> 199: 1.6100031 0.60837890 -0.490669365 1.5215956 -0.40423292 -0.79822504
#> 200: -3.7753501 -1.94203018 -0.003865373 0.4793742 -1.54926845 0.33288519