File size: 4,948 Bytes
fca8b64 e0febff fca8b64 9695f77 fca8b64 e0febff fca8b64 e0febff fca8b64 e0febff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
library_name: pytorch
license: mit
base_model: torchvision/resnet18
tags:
- image-classification
- scene-classification
- transfer-learning
- pytorch
- computer-vision
metrics:
- accuracy
model-index:
- name: scene-classifier-resnet18
results: []
---
# Scene Classifier - ResNet18
This model is a fine-tuned version of ResNet-18 pretrained on ImageNet. It classifies images into 4 scene categories: cafe, gym, library, and outdoor.
## Model description
This model uses a dataset of video frames extracted from recordings of different indoor and outdoor locations. The ResNet-18 architecture was chosen for its balance of accuracy and computational efficiency, using transfer learning from ImageNet pretrained weights. Only the final fully-connected layer was retrained for the 4-class classification task.
The model is part of a larger pipeline that generates contextual music based on scene classification combined with weather and temporal metadata.
## Intended uses & limitations
**Intended use:**
- Scene classification for context-aware applications
- Image-to-music generation pipelines
- Indoor/outdoor scene detection
- Educational demonstrations of transfer learning
**Limitations:**
- Limited to 4 specific scene categories (cafe, gym, library, outdoor)
- Limited to Carnegie Mellon University (CMU) campus
- Trained on relatively small dataset extracted from videos
- May not generalize well to significantly different scene compositions
- Performance may degrade on low-quality or heavily edited images
- Indoor scenes may be confused if they share similar visual features
## Training and evaluation data
**Dataset:** Video frame extraction from 4 scene categories
- Classes: cafe, gym, library, outdoor
- Source: Personal video recordings of various locations
- Extraction: Sampled every 10th frame from videos
- Total frames: Approximately 500+ images
- Format: JPEG, 224x224 resolution after preprocessing
The dataset represents real-world indoor and outdoor environments with varying lighting conditions, angles, and compositions.
## Training procedure
### Data preprocessing
Images were preprocessed with resize to 224x224 and converted to tensors.
### Model architecture
- Base model: ResNet-18 (ImageNet pretrained)
- Modified layer: Final fully-connected layer changed from 1000 classes to 4 classes
- Transfer learning: All layers except final FC layer retained pretrained weights
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-4
- train_batch_size: 32
- optimizer: Adam with default betas=(0.9, 0.999)
- loss_function: CrossEntropyLoss
- num_epochs: 3
- device: CUDA (GPU accelerated)
### Training results
Training was conducted over 3 epochs with consistent loss reduction:
| Epoch | Training Loss | Status |
|:-----:|:-------------:|:------:|
| 1 | 0.3395 | ✓ |
| 2 | 0.0111 | ✓ |
| 3 | 0.0041 | ✓ |
Note: Formal validation metrics were not computed during training. Model was validated qualitatively on held-out images.
## Usage
This can be used to classify any input image into one of four classifiers: Library, Cafe, Gym, Outdoor.
### Loading the model
import torch
from torchvision import models, transforms
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet18(weights=None)
model.fc = torch.nn.Linear(model.fc.in_features, 4)
model.load_state_dict(torch.load("pytorch_model.pth", map_location=device))
model = model.to(device)
model.eval()
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
class_labels = ["cafe", "gym", "library", "outdoor"]
### Inference
image = Image.open("your_image.jpg")
if image.mode != 'RGB':
image = image.convert('RGB')
input_tensor = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
outputs = model(input_tensor)
predicted_idx = outputs.argmax(dim=1).item()
predicted_class = class_labels[predicted_idx]
confidence = torch.softmax(outputs, dim=1)[0][predicted_idx].item()
print(f"Predicted: {predicted_class} (confidence: {confidence:.2%})")
## Framework versions
- PyTorch: 2.0+
- Torchvision: 0.15+
- Python: 3.8+
- Pillow: 9.0+
## Model Architecture Details
ResNet-18 Structure:
- Input: 3x224x224 RGB image
- Convolutional layers with residual connections
- Global average pooling
- Final FC layer: 512 to 4 classes
- Total parameters: approximately 11.7M (only approximately 2K trainable in final layer)
## Additional Information
This model was developed as part of a course project (24-679) exploring multimodal AI systems.
It serves as the visual classification component in an image-to-music generation pipeline that combines scene recognition,
metadata extraction, weather context, and music synthesis.
AI- ChatGPT, Claude were used in the creation of this model and dataset |