ShiftBench
A benchmark that stops hiding distribution shift. Train on 2015–2018, test on 2019–2022.
Interactive Results Explorer
Pre-computed results — See how model performance degrades under distribution shift.
Performance Under Shift
Key Insight: ViT-B/16 shows the smallest performance gap (-7%) under temporal shift, likely due to better learned representations. However, all models degrade significantly—random splits would hide this.
Reproduce These Results
git clone https://github.com/croesus245/shiftbench-benchmark && cd shiftbench-benchmark
pip install -r requirements.txt
python scripts/download_data.py # Download ISIC data
python scripts/train.py --model vit # Train model
python scripts/evaluate.py --split temporal # Evaluate under shift
Why I Built This
I kept reading papers with 95%+ accuracy on skin lesion classification. Then I'd look closer: random train/test splits from the same year. No wonder they looked great.
I wanted to understand what actually happens when you train on old data and test on new data. Spoiler: it's worse than the papers suggest.
What I Actually Tried
- Temporal split: Train on 2015-2018, test on 2019-2022. The gap hurts.
- Subgroup analysis: Checked performance by anatomical site as a proxy for different populations
- Calibration metrics: Because accuracy is meaningless if confidence is wrong
- Uncertainty methods: Tried MC Dropout, ensembles, temperature scaling. None fully solved it.
- Documented failures: Including the things I tried that didn't work
Dataset
Source: ISIC 2019 Challenge — 25,331 dermoscopic images, 8 diagnostic categories. CC BY-NC 4.0 license. See datasheet for full provenance.
┌──────────────────────────────────────────────────────────────┐
│ DATA PIPELINE │
├──────────────┬──────────────┬──────────────┬────────────────┤
│ ISIC 2019 │ Metadata │ Shift │ Stratified │
│ (25K imgs) │ (age, site) │ Simulation │ Splits │
└──────────────┴──────────────┴──────────────┴────────────────┘
│
▼
Train: 18K │ Test-IID: 2.5K │ Test-Shift: 4.8K
Results
Main Results (Test: 2019–2022)
| Model | Accuracy | Balanced Acc | ECE |
|---|---|---|---|
| ResNet-50 | 0.78 | 0.71 | 0.12 |
| EfficientNet-B4 | 0.81 | 0.74 | 0.09 |
| ViT-B/16 | 0.84 | 0.77 | 0.07 |
Temporal Degradation
| Model | 2019 | 2020 | 2021 | 2022 |
|---|---|---|---|---|
| ResNet-50 | 0.82 | 0.79 | 0.76 | 0.74 |
| EfficientNet-B4 | 0.84 | 0.82 | 0.79 | 0.78 |
| ViT-B/16 | 0.87 | 0.85 | 0.82 | 0.81 |
All models show degradation on newer data. ViT degrades least (6% drop vs 8% for ResNet).
Demographic Slice Analysis
| Model | Light | Medium | Dark | Gap |
|---|---|---|---|---|
| ResNet-50 | 0.81 | 0.77 | 0.73 | 8% |
| EfficientNet-B4 | 0.84 | 0.80 | 0.78 | 6% |
| ViT-B/16 | 0.86 | 0.83 | 0.81 | 5% |
All models underperform on darker skin tones. This is a known limitation of the ISIC dataset.
Negative Results
Mixup Augmentation
No accuracy improvement. Hurt calibration (ECE increased by 0.03). Hypothesis: interpolating skin lesion images creates unrealistic examples.
Heavy Augmentation
+2% accuracy, but ECE increased from 0.07 to 0.11. Model became overconfident on augmented distribution.
Focal Loss
Marginal improvement on rare classes (+1%), but degraded common class performance (-2%). Net negative.
Limitations
- Fitzpatrick proxy is imperfect—derived from image luminance, not clinical annotation
- ISIC data skews toward lighter skin tones (80%+ of dataset)
- Temporal shift may conflate scanner changes with true distribution shift
- Not for clinical deployment—research benchmark only
- Limited to dermoscopy images; doesn't generalize to other imaging modalities