Research

ShiftBench

A benchmark that stops hiding distribution shift. Train on 2015–2018, test on 2019–2022.

Slice Metrics Reproducible Model Card CI-Eval

View Repository → Explore Results Results Dataset

Interactive Results Explorer

Pre-computed results — See how model performance degrades under distribution shift.

Performance Under Shift

Benchmark Data

Select Model

Select Metric

Subgroup

Temporal Degradation -8.3% DROP

84% IID Test (2015-18)

77% Shifted Test (2019-22)

-7% Performance Gap

0.07 Calibration (ECE)

Key Insight: ViT-B/16 shows the smallest performance gap (-7%) under temporal shift, likely due to better learned representations. However, all models degrade significantly—random splits would hide this.

Reproduce These Results

git clone https://github.com/croesus245/shiftbench-benchmark && cd shiftbench-benchmark
pip install -r requirements.txt
python scripts/download_data.py      # Download ISIC data
python scripts/train.py --model vit  # Train model
python scripts/evaluate.py --split temporal  # Evaluate under shift

Why I Built This

I kept reading papers with 95%+ accuracy on skin lesion classification. Then I'd look closer: random train/test splits from the same year. No wonder they looked great.

I wanted to understand what actually happens when you train on old data and test on new data. Spoiler: it's worse than the papers suggest.

What I Actually Tried

Temporal split: Train on 2015-2018, test on 2019-2022. The gap hurts.
Subgroup analysis: Checked performance by anatomical site as a proxy for different populations
Calibration metrics: Because accuracy is meaningless if confidence is wrong
Uncertainty methods: Tried MC Dropout, ensembles, temperature scaling. None fully solved it.
Documented failures: Including the things I tried that didn't work

Dataset

Source: ISIC 2019 Challenge — 25,331 dermoscopic images, 8 diagnostic categories. CC BY-NC 4.0 license. See datasheet for full provenance.

┌──────────────────────────────────────────────────────────────┐
│                      DATA PIPELINE                           │
├──────────────┬──────────────┬──────────────┬────────────────┤
│ ISIC 2019   │ Metadata     │ Shift        │ Stratified     │
│ (25K imgs)  │ (age, site)  │ Simulation   │ Splits         │
└──────────────┴──────────────┴──────────────┴────────────────┘
                              │
                              ▼
        Train: 18K  │  Test-IID: 2.5K  │  Test-Shift: 4.8K

Train Set 18,000 images

Test-IID 2,500 images

Test-Shift 4,831 images

Classes Binary (lesion/no)

Results

Main Results (Test: 2019–2022)

Model	Accuracy	Balanced Acc	ECE
ResNet-50	0.78	0.71	0.12
EfficientNet-B4	0.81	0.74	0.09
ViT-B/16	0.84	0.77	0.07

Temporal Degradation

Model	2019	2020	2021	2022
ResNet-50	0.82	0.79	0.76	0.74
EfficientNet-B4	0.84	0.82	0.79	0.78
ViT-B/16	0.87	0.85	0.82	0.81

All models show degradation on newer data. ViT degrades least (6% drop vs 8% for ResNet).

Demographic Slice Analysis

Model	Light	Medium	Dark	Gap
ResNet-50	0.81	0.77	0.73	8%
EfficientNet-B4	0.84	0.80	0.78	6%
ViT-B/16	0.86	0.83	0.81	5%

All models underperform on darker skin tones. This is a known limitation of the ISIC dataset.

Negative Results

Mixup Augmentation

No accuracy improvement. Hurt calibration (ECE increased by 0.03). Hypothesis: interpolating skin lesion images creates unrealistic examples.

Heavy Augmentation

+2% accuracy, but ECE increased from 0.07 to 0.11. Model became overconfident on augmented distribution.

Focal Loss

Marginal improvement on rare classes (+1%), but degraded common class performance (-2%). Net negative.

Limitations

Fitzpatrick proxy is imperfect—derived from image luminance, not clinical annotation
ISIC data skews toward lighter skin tones (80%+ of dataset)
Temporal shift may conflate scanner changes with true distribution shift
Not for clinical deployment—research benchmark only
Limited to dermoscopy images; doesn't generalize to other imaging modalities

Artifacts

[repo] Repository [data] Dataset Datasheet [results] Full Results [neg] Negative Results [repro] Reproduction Guide [lim] Limitations