capit-sat
Show, Attend and Tell image captioner, trained from scratch on Flickr8k (Karpathy split). The glass-box half of capit โ exposes per-word attention, beam candidates, and word-by-word playback.
Test-set scores (pycocoevalcap, Karpathy test = 1000 images)
| beam | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr |
|---|---|---|---|---|---|
| 1 | 61.99 | 44.37 | 30.23 | 20.05 | 55.51 |
| 3 | 64.77 | 47.34 | 33.68 | 23.45 | 62.20 |
| 5 | 65.54 | 47.84 | 34.08 | 23.63 | 62.80 |
Training
- Backbone: frozen ResNet-50 (ImageNet). Decoder trained from scratch.
- Best val BLEU-4 19.62 at epoch 7 (early-stopped); Colab T4.
- Splits: train 6000, val 1000, test 1000.
Known limitation
Attention is effectively 7x7: ResNet-50 at 224px is natively 7x7 and the encoder upsamples to 14x14, so heatmaps are coarse (~32px blocks). Captions are grounded; the spots are region-level, not pixel-level.
Use
huggingface_hub.hf_hub_download("Bukunmi2108/capit-sat", "capit-sat.pt") + vocab.json, then
capit.serving.load_artifact(...).