Skip to contents

Example Data: Interaction Effects

To illustrate the key differences between LOCO and LOCI, we’ll use a data generating process with interaction effects:

\[y = 2 \cdot x_1 \cdot x_2 + x_3 + \epsilon\]

where \(\epsilon \sim N(0, 0.5^2)\) and all features \(x_1, x_2, x_3, noise_1, noise_2 \sim N(0,1)\) are independent.

Key characteristics:

  • \(x_1, x_2\): Have NO individual effects, only interact with each other
  • \(x_3\): Has a direct main effect on \(y\)
  • \(noise_1, noise_2\): Pure noise variables with no effect on \(y\)

This setup highlights how LOCO and LOCI handle interaction effects differently.

Leave-One-Covariate-Out (LOCO)

LOCO measures feature importance by comparing model performance with and without each feature. For each feature, the learner is retrained without that feature and the performance difference indicates the feature’s importance.

For feature \(j\), LOCO is calculated as the difference in expected loss of the model fit without the feature and the full model: \[\text{LOCO}_j = \mathbb{E}(L(Y, f_{-j}(X_{-j}))) - \mathbb{E}(L(Y, f(X)))\]

Higher values indicate more important features (larger performance drop when removed).

task <- sim_dgp_interactions(n = 500)
learner <- lrn("regr.lm")
measure <- msr("regr.mse")

loco <- LOCO$new(
  task = task,
  learner = learner,
  measure = measure
)
#>  No <Resampling> provided, using holdout resampling with default ratio.

loco$compute()
#> Key: <feature>
#>    feature  importance
#>     <char>       <num>
#> 1:  noise1  0.02343712
#> 2:  noise2  0.01808013
#> 3:      x1 -0.14135998
#> 4:      x2  0.11463314
#> 5:      x3  0.65082799

Leave-One-Covariate-In (LOCI)

LOCI measures feature importance by training models with only each individual feature and comparing their performance to a featureless (baseline) model. This shows how much predictive power each feature provides on its own, above and beyond the optimal constant prediction.

For feature \(j\), LOCI is calculated as the difference in expected loss of the featureless learner or constant model and the model including only the feature: \[\text{LOCI}_j = \mathbb{E}(L(Y, f_{\emptyset})) - \mathbb{E}(L(Y, f_j(X_{j})))\]

Higher values indicate more important features (better individual performance compared to baseline).

loci <- LOCI$new(
  task = task,
  learner = learner,
  measure = measure
)
#>  No <Resampling> provided, using holdout resampling with default ratio.

loci$compute()
#> Key: <feature>
#>    feature  importance
#>     <char>       <num>
#> 1:  noise1  0.09848467
#> 2:  noise2 -0.05579254
#> 3:      x1 -0.44010340
#> 4:      x2  0.11155710
#> 5:      x3  0.58359019

Understanding the Results

LOCO results interpretation:

  • \(x_1\) and \(x_2\) show low importance because removing either one eliminates the interaction (\(x_1 \cdot x_2\) becomes zero)
  • \(x_3\) shows higher importance due to its direct main effect
  • This demonstrates LOCO’s limitation with interaction effects

LOCI results interpretation:

  • \(x_1\) and \(x_2\) show low importance because individually they have no predictive power
  • \(x_3\) shows high importance because it alone can predict \(y\) reasonably well
  • This shows how LOCI focuses on individual feature contributions

Multiple Refits

Like PFI has iters_perm for multiple permutation iterations, LOCO and LOCI support iters_refit for multiple refit iterations per resampling iteration, which then also provide standard deviations:

loco_multi = LOCO$new(
  task = task,
  learner = learner,
  measure = measure,
  resampling = rsmp("cv", folds = 3),
  iters_refit = 3L
)

loco_multi$compute()
#> Key: <feature>
#>    feature  importance         sd
#>     <char>       <num>      <num>
#> 1:  noise1 -0.01212711 0.05104637
#> 2:  noise2 -0.02467126 0.05152184
#> 3:      x1 -0.08943524 0.08786365
#> 4:      x2 -0.07556117 0.29284316
#> 5:      x3  0.82401461 0.34619308

# Check individual scores with multiple refits
loco_multi$scores[1:10, ] |>
  knitr::kable(digits = 4, caption = "LOCO scores per refit and resampling fold")
LOCO scores per refit and resampling fold
feature iter_rsmp iter_refit regr.mse_orig regr.mse_post importance
noise1 1 1 3.8491 3.7693 -0.0798
noise1 1 2 3.8491 3.7693 -0.0798
noise1 1 3 3.8491 3.7693 -0.0798
noise1 2 1 4.2981 4.3134 0.0153
noise1 2 2 4.2981 4.3134 0.0153
noise1 2 3 4.2981 4.3134 0.0153
noise1 3 1 5.3040 5.3321 0.0281
noise1 3 2 5.3040 5.3321 0.0281
noise1 3 3 5.3040 5.3321 0.0281
noise2 1 1 3.8491 3.7563 -0.0928

Comparing LOCO and LOCI

# Combine results for comparison
importance_combined <- rbind(
  loco$importance[, method := "LOCO"],
  loci$importance[, method := "LOCI"]
)

importance_combined <- importance_combined |>
  dcast(feature ~ method, value.var = "importance")

importance_combined |>
  knitr::kable(digits = 4, caption = "LOCO vs LOCI importance scores")
LOCO vs LOCI importance scores
feature LOCI LOCO
noise1 0.0985 0.0234
noise2 -0.0558 0.0181
x1 -0.4401 -0.1414
x2 0.1116 0.1146
x3 0.5836 0.6508

Interpreting the results:

  • LOCO: Higher values indicate more important features (larger performance drop when removed)
  • LOCI: Higher values indicate more important features (better individual performance compared to baseline)
    • Positive values: feature performs better than featureless baseline
    • Negative values: feature performs worse than featureless baseline

Micro-averaged vs Macro-averaged LOCO

The examples above use the micro-averaged approach (obs_loss = FALSE), computing importance as the difference of expected losses using the measure’s default aggregation: \(\mathbb{E}(L(Y, f_{-j}(X_{-j}))) - \mathbb{E}(L(Y, f(X)))\).

LOCO can also be computed using the macro-averaged approach (obs_loss = TRUE), which follows the original paper formulation by aggregating observation-wise differences with the median of absolute differences: \(\mathrm{median}(\{|L(y_i, f_{-j}(x_{i,-j}))| - |L(y_i, f(x_i))|\}_{i=1}^n)\).

The macro-averaged approach provides access to individual observation-level differences:

# Use macro-averaged approach with median aggregation
loco_obs <- LOCO$new(
  task = task,
  learner = learner,
  measure = msr("regr.mae"),  # MAE supports obs_loss
  obs_loss = TRUE,
  aggregation_fun = median
)
#>  No <Resampling> provided, using holdout resampling with default ratio.

loco_obs$compute()
#> Key: <feature>
#>    feature   importance
#>     <char>        <num>
#> 1:  noise1  0.001274411
#> 2:  noise2  0.003229655
#> 3:      x1  0.001995886
#> 4:      x2 -0.077958164
#> 5:      x3  0.315709247

# Compare aggregated results
rbind(
  loco$importance[, method := "Micro-averaged (measure default)"],
  loco_obs$importance[, method := "Macro-averaged (median of diffs)"]
) |>
  dcast(feature ~ method, value.var = "importance") |>
  knitr::kable(digits = 4, caption = "Comparison of LOCO aggregation approaches")
Comparison of LOCO aggregation approaches
feature Macro-averaged (median of diffs) Micro-averaged (measure default)
noise1 0.0013 0.0234
noise2 0.0032 0.0181
x1 0.0020 -0.1414
x2 -0.0780 0.1146
x3 0.3157 0.6508

The $obs_losses field provides detailed observation-level information:

# Examine observation-wise data for first feature
feature_data <- loco_obs$obs_losses[feature == "x3"]

# Show structure
str(feature_data)
#> Classes 'data.table' and 'data.frame':   167 obs. of  10 variables:
#>  $ row_ids         : int  3 4 5 6 8 9 11 12 13 15 ...
#>  $ feature         : chr  "x3" "x3" "x3" "x3" ...
#>  $ iteration       : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ iter_refit      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ truth           : num  2.831 -0.293 -2.552 0.476 8.165 ...
#>  $ response_ref    : num  0.468 0.0826 -2.7257 0.7273 1.0292 ...
#>  $ response_feature: num  0.828 0.247 -0.983 -0.334 -0.975 ...
#>  $ loss_ref        : num  2.363 0.376 0.173 0.251 7.136 ...
#>  $ loss_feature    : num  2.003 0.541 1.569 0.81 9.14 ...
#>  $ obs_diff        : num  -0.36 0.165 1.396 0.559 2.004 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

# Display first few observations
head(feature_data) |>
  knitr::kable(digits = 3, caption = "Observation-wise losses for x3")
Observation-wise losses for x3
row_ids feature iteration iter_refit truth response_ref response_feature loss_ref loss_feature obs_diff
3 x3 1 1 2.831 0.468 0.828 2.363 2.003 -0.360
4 x3 1 1 -0.293 0.083 0.247 0.376 0.541 0.165
5 x3 1 1 -2.552 -2.726 -0.983 0.173 1.569 1.396
6 x3 1 1 0.476 0.727 -0.334 0.251 0.810 0.559
8 x3 1 1 8.165 1.029 -0.975 7.136 9.140 2.004
9 x3 1 1 -0.124 0.415 -0.205 0.539 0.081 -0.458

Each row contains:

  • row_ids: Test set observation identifiers
  • truth: Original target values
  • response_ref: Full model predictions
  • response_feature: Reduced model predictions (without this feature)
  • loss_ref / loss_feature: Individual losses from each model
  • obs_diff: Observation-wise importance differences

Note that in case of regr.mse which aggregates with the mean and using aggregation_fun = mean to aggregate the observation-wise losses, both approaches yield the same result, but in slightly different ways:

set.seed(1)
resampling = rsmp("holdout")
resampling$instantiate(task)

loco_orig <- LOCO$new(
  task = task,
  learner = learner,
  resampling = resampling,
  measure = measure
)

loco_obsloss <- LOCO$new(
  task = task,
  learner = learner,
  resampling = resampling,
  measure = measure,
  obs_loss = TRUE,
  aggregation_fun = mean
)

rbind(
  loco_orig$compute()[, method := "micro-averaged"],
  loco_obsloss$compute()[, method := "macro-averaged"]
) |>
  dcast(feature ~ method, value.var = "importance")
#> Key: <feature>
#>    feature macro-averaged micro-averaged
#>     <char>          <num>          <num>
#> 1:  noise1    -0.06116845    -0.06116845
#> 2:  noise2    -0.03037297    -0.03037297
#> 3:      x1    -0.04521521    -0.04521521
#> 4:      x2     0.02801885     0.02801885
#> 5:      x3     1.10230314     1.10230314

Classification Example with Probability Access

For classification tasks, the obs_losses field stores predicted classes while preserving access to probabilities from the original mlr3 prediction objects:

# Binary classification task
task_classif <- tgen("circle", d = 6)$generate(n = 200)

# Learner with probability predictions (required for accessing probabilities later)
learner_classif <- lrn("classif.rpart", predict_type = "prob")

# LOCO with observation-wise losses
loco_classif <- LOCO$new(
  task = task_classif,
  learner = learner_classif,
  measure = msr("classif.ce"),  # Classification error supports obs_loss (zero-one in this case)
  obs_loss = TRUE,
  features = c("x1", "x2"),  # Focus on just two features for clarity
  aggregation_fun = mean
)
#>  No <Resampling> provided, using holdout resampling with default ratio.

loco_classif$compute()
#> Key: <feature>
#>    feature importance
#>     <char>      <num>
#> 1:      x1 0.04477612
#> 2:      x2 0.00000000

# Show classification importance
loco_classif$importance |>
  knitr::kable(digits = 4, caption = "LOCO importance for classification")
LOCO importance for classification
feature importance
x1 0.0448
x2 0.0000

Examine the observation-wise classification results:

# Look at observation-wise data
classif_data <- head(loco_classif$obs_losses, 8)

classif_data |>
  knitr::kable(caption = "Observation-wise classification losses")
Observation-wise classification losses
row_ids feature iteration iter_refit truth response_ref response_feature loss_ref loss_feature obs_diff
5 x1 1 1 B B B 0 0 0
5 x2 1 1 B B B 0 0 0
11 x1 1 1 B B B 0 0 0
11 x2 1 1 B B B 0 0 0
14 x1 1 1 B B B 0 0 0
14 x2 1 1 B B B 0 0 0
15 x1 1 1 B B B 0 0 0
15 x2 1 1 B B B 0 0 0

For classification, the response columns contain predicted classes (factors). Note that $resample_result only contains the reference model predictions (full model for LOCO, featureless for LOCI), not the feature-specific models:

# Access feature-specific predictions (available with obs_loss = TRUE)
head(loco_classif$predictions)
#>    feature iteration iter_refit          prediction
#>     <char>     <int>      <int>              <list>
#> 1:      x1         1          1 <PredictionClassif>
#> 2:      x2         1          1 <PredictionClassif>

# Get probabilities for a specific feature and iteration  
if (nrow(loco_classif$predictions) > 0) {
  feature_pred <- loco_classif$predictions[1, ]$prediction[[1]]
  
  # Access both predicted classes and probabilities
  # Note: probabilities are only available because we used predict_type = "prob"
  head(feature_pred$response)  # Predicted classes
  head(feature_pred$prob)      # Probability matrix
}
#>               A         B
#> [1,] 0.06060606 0.9393939
#> [2,] 0.06060606 0.9393939
#> [3,] 0.11111111 0.8888889
#> [4,] 0.09523810 0.9047619
#> [5,] 0.80851064 0.1914894
#> [6,] 0.06060606 0.9393939

# Compare with reference model (full model for LOCO)
ref_pred <- loco_classif$resample_result$prediction(1)
head(ref_pred$prob)  # Reference model probabilities
#>               A          B
#> [1,] 0.06060606 0.93939394
#> [2,] 0.06060606 0.93939394
#> [3,] 0.11111111 0.88888889
#> [4,] 0.09523810 0.90476190
#> [5,] 0.91891892 0.08108108
#> [6,] 0.06060606 0.93939394

The $predictions field provides complete access to feature-specific prediction objects:

  • feature: Which feature was left out (LOCO) or left in (LOCI)
  • iteration: Resampling iteration
  • iter_refit: Refit iteration within resampling
  • prediction: Complete mlr3 prediction object with responses, probabilities, etc.

This gives you full access to:

  • Feature-specific probabilities: prediction$prob
  • Feature-specific predicted classes: prediction$response
  • Additional prediction data: Any other fields the learner provides

The observation-wise approach is particularly useful for:

  • Robust aggregation: Using median instead of mean to reduce outlier influence
  • Individual analysis: Examining which specific observations drive importance scores
  • Distribution analysis: Understanding the variability in feature effects across observations
  • Model debugging: Identifying problematic observations or prediction patterns