capit-sat

Show, Attend and Tell image captioner, trained from scratch on Flickr8k (Karpathy split). The glass-box half of capit โ€” exposes per-word attention, beam candidates, and word-by-word playback.

Test-set scores (pycocoevalcap, Karpathy test = 1000 images)

beam BLEU-1 BLEU-2 BLEU-3 BLEU-4 CIDEr
1 61.99 44.37 30.23 20.05 55.51
3 64.77 47.34 33.68 23.45 62.20
5 65.54 47.84 34.08 23.63 62.80

Training

  • Backbone: frozen ResNet-50 (ImageNet). Decoder trained from scratch.
  • Best val BLEU-4 19.62 at epoch 7 (early-stopped); Colab T4.
  • Splits: train 6000, val 1000, test 1000.

Known limitation

Attention is effectively 7x7: ResNet-50 at 224px is natively 7x7 and the encoder upsamples to 14x14, so heatmaps are coarse (~32px blocks). Captions are grounded; the spots are region-level, not pixel-level.

Use

huggingface_hub.hf_hub_download("Bukunmi2108/capit-sat", "capit-sat.pt") + vocab.json, then capit.serving.load_artifact(...).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Bukunmi2108/capit-sat 1