ML Systems Engineer
Research

ShiftBench

A benchmark that stops hiding distribution shift. Train on 2015–2018, test on 2019–2022.

Slice Metrics Reproducible Model Card CI-Eval

Interactive Results Explorer

Pre-computed results — See how model performance degrades under distribution shift.

Performance Under Shift

Benchmark Data
Temporal Degradation -8.3% DROP
84% IID Test (2015-18)
77% Shifted Test (2019-22)
-7% Performance Gap
0.07 Calibration (ECE)

Key Insight: ViT-B/16 shows the smallest performance gap (-7%) under temporal shift, likely due to better learned representations. However, all models degrade significantly—random splits would hide this.

Reproduce These Results

git clone https://github.com/croesus245/shiftbench-benchmark && cd shiftbench-benchmark
pip install -r requirements.txt
python scripts/download_data.py      # Download ISIC data
python scripts/train.py --model vit  # Train model
python scripts/evaluate.py --split temporal  # Evaluate under shift

Why I Built This

I kept reading papers with 95%+ accuracy on skin lesion classification. Then I'd look closer: random train/test splits from the same year. No wonder they looked great.

I wanted to understand what actually happens when you train on old data and test on new data. Spoiler: it's worse than the papers suggest.

What I Actually Tried

  • Temporal split: Train on 2015-2018, test on 2019-2022. The gap hurts.
  • Subgroup analysis: Checked performance by anatomical site as a proxy for different populations
  • Calibration metrics: Because accuracy is meaningless if confidence is wrong
  • Uncertainty methods: Tried MC Dropout, ensembles, temperature scaling. None fully solved it.
  • Documented failures: Including the things I tried that didn't work

Dataset

Source: ISIC 2019 Challenge — 25,331 dermoscopic images, 8 diagnostic categories. CC BY-NC 4.0 license. See datasheet for full provenance.

┌──────────────────────────────────────────────────────────────┐
│                      DATA PIPELINE                           │
├──────────────┬──────────────┬──────────────┬────────────────┤
│ ISIC 2019   │ Metadata     │ Shift        │ Stratified     │
│ (25K imgs)  │ (age, site)  │ Simulation   │ Splits         │
└──────────────┴──────────────┴──────────────┴────────────────┘
                              │
                              ▼
        Train: 18K  │  Test-IID: 2.5K  │  Test-Shift: 4.8K
                        
Train Set 18,000 images
Test-IID 2,500 images
Test-Shift 4,831 images
Classes Binary (lesion/no)

Results

Main Results (Test: 2019–2022)

Model Accuracy Balanced Acc ECE
ResNet-50 0.78 0.71 0.12
EfficientNet-B4 0.81 0.74 0.09
ViT-B/16 0.84 0.77 0.07

Temporal Degradation

Model 2019 2020 2021 2022
ResNet-50 0.82 0.79 0.76 0.74
EfficientNet-B4 0.84 0.82 0.79 0.78
ViT-B/16 0.87 0.85 0.82 0.81

All models show degradation on newer data. ViT degrades least (6% drop vs 8% for ResNet).

Demographic Slice Analysis

Model Light Medium Dark Gap
ResNet-50 0.81 0.77 0.73 8%
EfficientNet-B4 0.84 0.80 0.78 6%
ViT-B/16 0.86 0.83 0.81 5%

All models underperform on darker skin tones. This is a known limitation of the ISIC dataset.

Negative Results

Mixup Augmentation

No accuracy improvement. Hurt calibration (ECE increased by 0.03). Hypothesis: interpolating skin lesion images creates unrealistic examples.

Heavy Augmentation

+2% accuracy, but ECE increased from 0.07 to 0.11. Model became overconfident on augmented distribution.

Focal Loss

Marginal improvement on rare classes (+1%), but degraded common class performance (-2%). Net negative.

Limitations

  • Fitzpatrick proxy is imperfect—derived from image luminance, not clinical annotation
  • ISIC data skews toward lighter skin tones (80%+ of dataset)
  • Temporal shift may conflate scanner changes with true distribution shift
  • Not for clinical deployment—research benchmark only
  • Limited to dermoscopy images; doesn't generalize to other imaging modalities

Artifacts