Skip to content

Benchmarks

Systematic evaluation of counterfactual explanation methods using standardized metrics and datasets.

Overview

The benchmarking framework enables:

  • Fair comparison of different CF methods
  • Reproducible experiments with Hydra configs
  • Comprehensive metrics covering multiple quality dimensions
  • Automated logging via MLflow

Sections

Section Description
Evaluation Metrics Definitions and usage of all available metrics
Benchmark Results Comparison tables and analysis
Running Benchmarks How to reproduce and extend benchmarks

Key Metrics

Counterfactual quality is assessed across multiple dimensions:

flowchart TD
    A[CF Quality] --> B[Validity]
    A --> C[Proximity]
    A --> D[Sparsity]
    A --> E[Plausibility]
    A --> F[Diversity]

    B --> B1[Does it change prediction?]
    C --> C1[How close to original?]
    D --> D1[How many features changed?]
    E --> E1[Is it realistic?]
    F --> F1[Are CFs diverse?]

Quick Benchmark

from counterfactuals.metrics import MetricsOrchestrator

# Initialize metrics
orchestrator = MetricsOrchestrator(
    metrics=["validity", "proximity_l2", "sparsity", "plausibility"],
    gen_model=flow_model
)

# Compute metrics
results = orchestrator.compute(
    x_cfs=counterfactuals,
    x_origs=original_instances,
    y_targets=target_labels,
    classifier=model
)

for metric, value in results.items():
    print(f"{metric}: {value:.4f}")

Results Summary

Method Validity Proximity Sparsity Plausibility
PPCEF 0.95 0.82 0.71 0.89
DICE 0.98 0.75 0.65 0.62
GLOBE-CE 0.91 0.88 0.78 0.71

See Benchmark Results for complete comparisons.