πΌοΈ Image Captioning Model β 100k Training Run
EfficientNet-V2-S Encoder + Transformer Decoder
This model takes an image as input and generates an English caption describing the image.
Model Overview
This repository contains a custom PyTorch image captioning model trained on a 100k-sample COCO-style image-caption dataset.
The model uses an encoder-decoder structure:
Input Image
β
EfficientNet-V2-S Image Encoder
β
Visual Feature Tokens
β
Transformer Text Decoder
β
Generated Caption
| Component | Description |
|---|---|
| Input | RGB image |
| Encoder | EfficientNet-V2-S pretrained on ImageNet |
| Decoder | Transformer decoder |
| Output | English image caption |
| Training samples | 100,000 |
| Validation samples | 20,000 |
| Vocabulary size | 9,721 tokens |
| Checkpoint | best_phase2.pt |
| Validation loss | 3.4565 |
Architecture Details
Image Encoder
| Setting | Value |
|---|---|
| Backbone | EfficientNet-V2-S |
| Pretraining | ImageNet |
| Image size | 224 Γ 224 |
| Visual tokens | 49 |
| Embedding dimension | 256 |
Text Decoder
| Setting | Value |
|---|---|
| Decoder type | Transformer Decoder |
| Vocabulary size | 9,721 |
| Embedding dimension | 256 |
| Transformer layers | 6 |
| Attention heads | 8 |
| Feed-forward dimension | 1024 |
| Maximum caption length | 52 |
| Dropout | 0.1 |
| Decoding methods | Greedy search, Beam search |
Repository Files
.
βββ best_phase2.pt # PyTorch checkpoint
βββ Traning-100k.ipynb # Training, loading, inference, and evaluation notebook
βββ README.md # Model card
Important Note About Vocabulary
This model uses a custom word-level vocabulary. The checkpoint stores the model weights, but it does not store the word-to-index and index-to-word mappings.
To reproduce captions correctly, the same vocabulary used during training is required.
Special tokens:
| Token | ID |
|---|---|
<PAD> |
0 |
<SOS> |
1 |
<EOS> |
2 |
<UNK> |
3 |
The recommended vocabulary file is:
vocab.json
Without the correct vocabulary, the model may generate token IDs, but those IDs cannot be reliably converted back into English captions.
Training Details
The model was trained in two phases:
| Phase | Encoder Setting | Purpose |
|---|---|---|
| Phase 1 | Frozen EfficientNet encoder | Train decoder and projection layers |
| Phase 2 | Partially unfrozen EfficientNet encoder | Fine-tune visual features |
| Setting | Value |
|---|---|
| Dataset format | COCO-style image-caption annotations |
| Training samples | 100,000 |
| Validation samples | 20,000 |
| Total captions used for vocabulary | 414,113 |
| Batch size | 356 |
| Image size | 224 Γ 224 |
| Maximum caption length | 52 |
| Optimizer | AdamW |
| Loss function | Cross entropy |
| Label smoothing | 0.1 |
| LR schedule | Warmup + cosine decay |
Evaluation Results
Evaluation was performed on 2,000 validation samples using beam search with beam size 5.
| Metric | Score |
|---|---|
| BLEU-1 | 37.88 |
| BLEU-4 | 9.36 |
| CIDEr | 0.8452 |
| Validation loss | 3.4565 |
Example prediction:
| Type | Caption |
|---|---|
| Ground truth | a bicycle replica with a clock as the front wheel |
| Greedy decoding | a bicycle is shown with a clock on it |
| Beam search | a bicycle with a clock on the side of it |
How to Use
This is a custom PyTorch model. It is not a standard Hugging Face Transformers model, so it cannot be loaded directly with:
AutoModel.from_pretrained(...)
Instead, use the architecture and loading code provided in:
Traning-100k.ipynb
The notebook includes:
Vocabulary class
COCOCaptionDataset class
EfficientNetEncoder
TransformerDecoder
ImageCaptioningModel
Checkpoint loading
Greedy decoding
Beam-search decoding
Evaluation code
Installation
Install the main dependencies:
pip install torch torchvision pillow numpy matplotlib nltk pycocotools pycocoevalcap einops
Image Preprocessing
Images are resized to 224 Γ 224 and normalized using ImageNet statistics.
import torchvision.transforms as T
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize(IMAGENET_MEAN, IMAGENET_STD),
])
Loading the Checkpoint
After defining the model architecture and loading the correct vocabulary, use:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ImageCaptioningModel(
vocab_size=9721,
embed_dim=256,
num_heads=8,
num_layers=6,
ff_dim=1024,
max_len=52,
dropout=0.1
).to(device)
checkpoint = torch.load("best_phase2.pt", map_location=device)
model.load_state_dict(checkpoint["model"])
model.eval()
print("Checkpoint loaded")
print("Checkpoint epoch:", checkpoint["epoch"])
print("Validation loss:", checkpoint["val_loss"])
Checkpoint metadata:
checkpoint["epoch"] = 14
checkpoint["val_loss"] = 3.4565230486026866
Caption Generation
The notebook includes greedy decoding and beam-search decoding.
from PIL import Image
image = Image.open("example.jpg").convert("RGB")
image_tensor = transform(image)
caption = model.generate_beam(image_tensor, beam_size=5)
print("Generated caption:", caption)
Example output:
a bicycle with a clock on the side of it
Limitations
This model is experimental and has some limitations:
- It uses a custom PyTorch architecture, not a standard Hugging Face Transformers architecture.
- It requires the original model class definitions to load correctly.
- It requires the same vocabulary used during training.
- Caption quality may be limited by the 100k-sample training subset.
- The model may generate generic captions for complex images.
- The model may hallucinate objects that are not present in the image.
- The tokenizer is word-level, so rare or unseen words are mapped to
<UNK>.
Intended Use
This model is intended for:
- Image caption generation
- Educational deep learning experiments
- Vision-language model learning
- Encoder-decoder architecture demonstrations
- COCO-style image captioning practice
Out-of-Scope Use
This model is not intended for:
- Safety-critical computer vision systems
- Medical image interpretation
- Legal or forensic image analysis
- Real-time production deployment without further validation
Citation
@misc{image_captioning_100k,
title = {Image Captioning Model with EfficientNet-V2-S Encoder and Transformer Decoder},
author = {Ali Sedghiye},
year = {2026},
note = {Custom PyTorch image captioning model trained on 100k COCO-style samples}
}
Author
Developed by Ali Sedghiye as a custom PyTorch image captioning model using an EfficientNet-V2-S image encoder and a Transformer text decoder.