capit-sat

Show, Attend and Tell image captioner, trained from scratch on Flickr8k (Karpathy split). The glass-box half of capit — exposes per-word attention, beam candidates, and word-by-word playback.

Test-set scores (pycocoevalcap, Karpathy test = 1000 images)

beam	BLEU-1	BLEU-2	BLEU-3	BLEU-4	CIDEr
1	61.99	44.37	30.23	20.05	55.51
3	64.77	47.34	33.68	23.45	62.20
5	65.54	47.84	34.08	23.63	62.80

Training

Backbone: frozen ResNet-50 (ImageNet). Decoder trained from scratch.
Best val BLEU-4 19.62 at epoch 7 (early-stopped); Colab T4.
Splits: train 6000, val 1000, test 1000.

Known limitation

Attention is effectively 7x7: ResNet-50 at 224px is natively 7x7 and the encoder upsamples to 14x14, so heatmaps are coarse (~32px blocks). Captions are grounded; the spots are region-level, not pixel-level.

Use

huggingface_hub.hf_hub_download("Bukunmi2108/capit-sat", "capit-sat.pt") + vocab.json, then capit.serving.load_artifact(...).

Downloads last month: -; Downloads are not tracked for this model. How to track

Bukunmi2108
/

capit-sat

capit-sat

Test-set scores (pycocoevalcap, Karpathy test = 1000 images)

Training

Known limitation

Use

Space using Bukunmi2108/capit-sat 1