--- language: - id license: mit tags: - image-classification - vision-transformer - vit - coffee - food-quality - pytorch datasets: - zaafirrahman/arabica-beans metrics: - accuracy --- # ☕ Arabica Coffee Bean Quality Classification — ViT & MobileNetV2 Trained models from undergraduate thesis research on classifying Arabica coffee bean quality using Vision Transformer (ViT), benchmarked against MobileNetV2 (CNN baseline). Classification follows the **Indonesian National Standard (SNI)** defect value system across **6 quality grades**. 📄 Full paper & poster → [GitHub Releases](https://github.com/zaafirrahman/coffeebean_quality_vit/releases) 💻 Code & notebooks → [GitHub Repo](https://github.com/zaafirrahman/coffeebean_quality_vit) 📦 Dataset → [Kaggle](https://www.kaggle.com/datasets/zaafirrahman/arabica-beans) --- ## 🏆 Model Files | File | Architecture | Pretrain | Patch Size | Train:Test | Train Acc | Test Acc | Prediction Acc | |---|---|---|---|---|---|---|---| | `vit-imgt-16(7030).pth` ⭐ | ViT-B/16 | ImageNet | 16×16 px | 70:30 | 97.02% | 84.72% | **91.67%** | | `vit-imgt-16(8020).pth` | ViT-B/16 | ImageNet | 16×16 px | 80:20 | 96.98% | 86.25% | 88.34% | | `vit-imgt-16(5050).pth` | ViT-B/16 | ImageNet | 16×16 px | 50:50 | 98.50% | 83.50% | 86.67% | | `vit-imgt-16(4060).pth` | ViT-B/16 | ImageNet | 16×16 px | 40:60 | 99.80% | 79.72% | 83.34% | | `vit-imgt-32(7030).pth` | ViT-B/32 | ImageNet | 32×32 px | 70:30 | 97.50% | 81.67% | 76.67% | | `vit-imgt-32(8020).pth` | ViT-B/32 | ImageNet | 32×32 px | 80:20 | 97.29% | 81.67% | 76.67% | | `vit-imgt-32(5050).pth` | ViT-B/32 | ImageNet | 32×32 px | 50:50 | 98.50% | 79.67% | 76.67% | | `vit-imgt-32(4060).pth` | ViT-B/32 | ImageNet | 32×32 px | 40:60 | 99.58% | 77.22% | 61.67% | | `vit-hgfc-16(8020).pth` | ViT-B/16 | HuggingFace | 16×16 px | 80:20 | 81.11% | 80.83% | **90.00%** | | `vit-hgfc-16(7030).pth` | ViT-B/16 | HuggingFace | 16×16 px | 70:30 | 81.31% | 78.05% | 90.00% | | `vit-hgfc-16(5050).pth` | ViT-B/16 | HuggingFace | 16×16 px | 50:50 | 79.83% | 75.00% | 85.00% | | `vit-hgfc-16(4060).pth` | ViT-B/16 | HuggingFace | 16×16 px | 40:60 | 79.58% | 71.53% | 83.34% | | `vit-hgfc-32(8020).pth` | ViT-B/32 | HuggingFace | 32×32 px | 80:20 | 82.71% | 81.67% | 78.34% | | `vit-hgfc-32(7030).pth` | ViT-B/32 | HuggingFace | 32×32 px | 70:30 | 82.26% | 79.17% | 78.34% | | `vit-hgfc-32(5050).pth` | ViT-B/32 | HuggingFace | 32×32 px | 50:50 | 81.33% | 77.67% | 80.00% | | `vit-hgfc-32(4060).pth` | ViT-B/32 | HuggingFace | 32×32 px | 40:60 | 79.58% | 76.94% | 78.34% | | `mobilenetv2(7030)weight.pth` | MobileNetV2 | ImageNet | — | 70:30 | 79.29% | 78.33% | **86.67%** | | `mobilenetv2(8020)weight.pth` | MobileNetV2 | ImageNet | — | 80:20 | 79.79% | 78.33% | 75.00% | | `mobilenetv2(5050)weight.pth` | MobileNetV2 | ImageNet | — | 50:50 | 80.00% | 76.17% | 81.67% | | `mobilenetv2(4060)weight.pth` | MobileNetV2 | ImageNet | — | 40:60 | 79.79% | 74.17% | 86.67% | | `coffeebean_vit_best.onnx` + `.onnx.data` | ViT-B/16 (best) | ImageNet | 16×16 px | 70:30 | — | — | **91.67%** | --- ## 🗂️ Quality Classes (SNI) | Label | Grade | |---|---| | `mutu1` | Specialty — 0 defects per 300g | | `mutu2` | Grade 1 — max 11 defect values | | `mutu3` | Grade 2 — 12–25 defect values | | `mutu4` | Grade 3 — 26–44 defect values | | `mutu5` | Grade 4a — 45–60 defect values | | `mutu6` | Grade 4b — 61–80 defect values | --- ## 🚀 Usage ### Load best model (ViT ImageNet B/16, 70:30) ```python import torch import torchvision.models as models from torchvision import transforms from PIL import Image # 1. Setup model class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6'] model = models.vit_b_16(weights=None) model.heads = torch.nn.Sequential( torch.nn.Dropout(0.1), torch.nn.Linear(768, len(class_names)) ) # 2. Load weights (download from this repo first) checkpoint = torch.load('vit-imgt-16(7030).pth', map_location='cpu') model.load_state_dict(checkpoint) model.eval() # 3. Preprocessing (ImageNet standard) transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) ]) # 4. Predict image = Image.open('your_coffee_image.jpg') input_tensor = transform(image).unsqueeze(0) with torch.no_grad(): outputs = model(input_tensor) probabilities = torch.softmax(outputs, dim=1) predicted_idx = torch.argmax(probabilities).item() print(f"Predicted: {class_names[predicted_idx]}") print(f"Confidence: {probabilities[0, predicted_idx].item():.4f}") ``` ### Load MobileNetV2 baseline ```python import torchvision.models as models class_names = ['mutu1', 'mutu2', 'mutu3', 'mutu4', 'mutu5', 'mutu6'] model = models.mobilenet_v2(weights=None) model.classifier[1] = torch.nn.Linear(model.last_channel, len(class_names)) checkpoint = torch.load('mobilenetv2(7030)weight.pth', map_location='cpu') model.load_state_dict(checkpoint) model.eval() ``` --- ## ⚙️ Training Details | Parameter | Value | |---|---| | Framework | PyTorch + torchvision | | Optimizer | AdamW (lr=1e-3, weight_decay=3e-2) | | Loss | CrossEntropyLoss | | Epochs | 90 | | Batch size | 10 | | Input size | 224×224 px | | Dropout | 0.1 (classifier head) | | Normalization | ImageNet mean/std ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) | --- ## 📊 Key Findings - ViT ImageNet B/16 outperforms MobileNetV2 by **+5% accuracy** - ViT is also **+2 images/second** faster at inference - Tradeoff: ViT model size is ~320 MB larger than MobileNetV2 --- ## 📄 Citation ```bibtex @thesis{zaafirrahman2024vit, author = {Aulya Az Zaafirrahman}, title = {Klasifikasi Mutu Biji Kopi Arabika Berbasis Image Processing Menggunakan Metode Vision Transformer (ViT)}, school = {Universitas Brawijaya}, type = {Teknik Industri Pertanian}, year = {2024} } ```