---
language:
- id
license: mit
tags:
- image-classification
- vision-transformer
- vit
- coffee
- food-quality
- pytorch
datasets:
- zaafirrahman/arabica-beans
metrics:
- accuracy
---

# ☕ Arabica Coffee Bean Quality Classification — ViT & MobileNetV2

Trained models from undergraduate thesis research on classifying Arabica coffee bean quality using Vision Transformer (ViT), benchmarked against MobileNetV2 (CNN baseline). Classification follows the **Indonesian National Standard (SNI)** defect value system across **6 quality grades**.

📄 Full paper & poster → [GitHub Releases](https://github.com/zaafirrahman/coffeebean_quality_vit/releases)
💻 Code & notebooks → [GitHub Repo](https://github.com/zaafirrahman/coffeebean_quality_vit)
📦 Dataset → [Kaggle](https://www.kaggle.com/datasets/zaafirrahman/arabica-beans)

---

## 🏆 Model Files

| File | Architecture | Pretrain | Patch Size | Train:Test | Train Acc | Test Acc | Prediction Acc |
|---|---|---|---|---|---|---|---|
| `vit-imgt-16(7030).pth` ⭐ | ViT-B/16 | ImageNet | 16×16 px | 70:30 | 97.02% | 84.72% | **91.67%** |
| `vit-imgt-16(8020).pth` | ViT-B/16 | ImageNet | 16×16 px | 80:20 | 96.98% | 86.25% | 88.34% |
| `vit-imgt-16(5050).pth` | ViT-B/16 | ImageNet | 16×16 px | 50:50 | 98.50% | 83.50% | 86.67% |
| `vit-imgt-16(4060).pth` | ViT-B/16 | ImageNet | 16×16 px | 40:60 | 99.80% | 79.72% | 83.34% |
| `vit-imgt-32(7030).pth` | ViT-B/32 | ImageNet | 32×32 px | 70:30 | 97.50% | 81.67% | 76.67% |
| `vit-imgt-32(8020).pth` | ViT-B/32 | ImageNet | 32×32 px | 80:20 | 97.29% | 81.67% | 76.67% |
| `vit-imgt-32(5050).pth` | ViT-B/32 | ImageNet | 32×32 px | 50:50 | 98.50% | 79.67% | 76.67% |
| `vit-imgt-32(4060).pth` | ViT-B/32 | ImageNet | 32×32 px | 40:60 | 99.58% | 77.22% | 61.67% |
| `vit-hgfc-16(8020).pth` | ViT-B/16 | HuggingFace | 16×16 px | 80:20 | 81.11% | 80.83% | **90.00%** |
| `vit-hgfc-16(7030).pth` | ViT-B/16 | HuggingFace | 16×16 px | 70:30 | 81.31% | 78.05% | 90.00% |
| `vit-hgfc-16(5050).pth` | ViT-B/16 | HuggingFace | 16×16 px | 50:50 | 79.83% | 75.00% | 85.00% |
| `vit-hgfc-16(4060).pth` | ViT-B/16 | HuggingFace | 16×16 px | 40:60 | 79.58% | 71.53% | 83.34% |
| `vit-hgfc-32(8020).pth` | ViT-B/32 | HuggingFace | 32×32 px | 80:20 | 82.71% | 81.67% | 78.34% |
| `vit-hgfc-32(7030).pth` | ViT-B/32 | HuggingFace | 32×32 px | 70:30 | 82.26% | 79.17% | 78.34% |
| `vit-hgfc-32(5050).pth` | ViT-B/32 | HuggingFace | 32×32 px | 50:50 | 81.33% | 77.67% | 80.00% |
| `vit-hgfc-32(4060).pth` | ViT-B/32 | HuggingFace | 32×32 px | 40:60 | 79.58% | 76.94% | 78.34% |
| `mobilenetv2(7030)weight.pth` | MobileNetV2 | ImageNet | — | 70:30 | 79.29% | 78.33% | **86.67%** |
| `mobilenetv2(8020)weight.pth` | MobileNetV2 | ImageNet | — | 80:20 | 79.79% | 78.33% | 75.00% |
| `mobilenetv2(5050)weight.pth` | MobileNetV2 | ImageNet | — | 50:50 | 80.00% | 76.17% | 81.67% |
| `mobilenetv2(4060)weight.pth` | MobileNetV2 | ImageNet | — | 40:60 | 79.79% | 74.17% | 86.67% |
| `coffeebean_vit_best.onnx` + `.onnx.data` | ViT-B/16 (best) | ImageNet | 16×16 px | 70:30 | — | — | **91.67%** |

---

## 🗂️ Quality Classes (SNI)

| Label | Grade |
|---|---|
| `mutu1` | Specialty — 0 defects per 300g |
| `mutu2` | Grade 1 — max 11 defect values |
| `mutu3` | Grade 2 — 12–25 defect values |
| `mutu4` | Grade 3 — 26–44 defect values |
| `mutu5` | Grade 4a — 45–60 defect values |
| `mutu6` | Grade 4b — 61–80 defect values |

---

## 🚀 Usage

### Load best model (ViT ImageNet B/16, 70:30)

```python
import torch
import torchvision.models as models
from torchvision import transforms
from PIL import Image

# 1. Setup model
class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.vit_b_16(weights=None)
model.heads = torch.nn.Sequential(
    torch.nn.Dropout(0.1),
    torch.nn.Linear(768, len(class_names))
)

# 2. Load weights (download from this repo first)
checkpoint = torch.load('vit-imgt-16(7030).pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()

# 3. Preprocessing (ImageNet standard)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# 4. Predict
image = Image.open('your_coffee_image.jpg')
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    outputs = model(input_tensor)
    probabilities = torch.softmax(outputs, dim=1)
    predicted_idx = torch.argmax(probabilities).item()

print(f"Predicted: {class_names[predicted_idx]}")
print(f"Confidence: {probabilities[0, predicted_idx].item():.4f}")
```

### Load MobileNetV2 baseline

```python
import torchvision.models as models

class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6']
model = models.mobilenet_v2(weights=None)
model.classifier[1] = torch.nn.Linear(model.last_channel, len(class_names))

checkpoint = torch.load('mobilenetv2(7030)weight.pth', map_location='cpu')
model.load_state_dict(checkpoint)
model.eval()
```

---

## ⚙️ Training Details

| Parameter | Value |
|---|---|
| Framework | PyTorch + torchvision |
| Optimizer | AdamW (lr=1e-3, weight_decay=3e-2) |
| Loss | CrossEntropyLoss |
| Epochs | 90 |
| Batch size | 10 |
| Input size | 224×224 px |
| Dropout | 0.1 (classifier head) |
| Normalization | ImageNet mean/std ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) |

---

## 📊 Key Findings

- ViT ImageNet B/16 outperforms MobileNetV2 by **+5% accuracy**
- ViT is also **+2 images/second** faster at inference
- Tradeoff: ViT model size is ~320 MB larger than MobileNetV2

---

## 📄 Citation

```bibtex
@thesis{zaafirrahman2024vit,
  author  = {Aulya Az Zaafirrahman},
  title   = {Klasifikasi Mutu Biji Kopi Arabika Berbasis Image Processing Menggunakan Metode Vision Transformer (ViT)},
  school  = {Universitas Brawijaya},
  type    = {Teknik Industri Pertanian},
  year    = {2024}
}
```