File size: 4,948 Bytes
fca8b64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0febff
fca8b64
 
 
 
 
 
 
 
 
 
 
 
9695f77
fca8b64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0febff
 
 
fca8b64
 
 
 
e0febff
fca8b64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0febff
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
library_name: pytorch
license: mit
base_model: torchvision/resnet18
tags:
- image-classification
- scene-classification
- transfer-learning
- pytorch
- computer-vision
metrics:
- accuracy
model-index:
- name: scene-classifier-resnet18
  results: []
---

# Scene Classifier - ResNet18

This model is a fine-tuned version of ResNet-18 pretrained on ImageNet. It classifies images into 4 scene categories: cafe, gym, library, and outdoor.

## Model description

This model uses a dataset of video frames extracted from recordings of different indoor and outdoor locations. The ResNet-18 architecture was chosen for its balance of accuracy and computational efficiency, using transfer learning from ImageNet pretrained weights. Only the final fully-connected layer was retrained for the 4-class classification task.

The model is part of a larger pipeline that generates contextual music based on scene classification combined with weather and temporal metadata.

## Intended uses & limitations

**Intended use:**
- Scene classification for context-aware applications
- Image-to-music generation pipelines
- Indoor/outdoor scene detection
- Educational demonstrations of transfer learning

**Limitations:**
- Limited to 4 specific scene categories (cafe, gym, library, outdoor)
- Limited to Carnegie Mellon University (CMU) campus
- Trained on relatively small dataset extracted from videos
- May not generalize well to significantly different scene compositions
- Performance may degrade on low-quality or heavily edited images
- Indoor scenes may be confused if they share similar visual features

## Training and evaluation data

**Dataset:** Video frame extraction from 4 scene categories

- Classes: cafe, gym, library, outdoor
- Source: Personal video recordings of various locations
- Extraction: Sampled every 10th frame from videos
- Total frames: Approximately 500+ images
- Format: JPEG, 224x224 resolution after preprocessing

The dataset represents real-world indoor and outdoor environments with varying lighting conditions, angles, and compositions.

## Training procedure

### Data preprocessing

Images were preprocessed with resize to 224x224 and converted to tensors.

### Model architecture

- Base model: ResNet-18 (ImageNet pretrained)
- Modified layer: Final fully-connected layer changed from 1000 classes to 4 classes
- Transfer learning: All layers except final FC layer retained pretrained weights

### Training hyperparameters

The following hyperparameters were used during training:

- learning_rate: 1e-4
- train_batch_size: 32
- optimizer: Adam with default betas=(0.9, 0.999)
- loss_function: CrossEntropyLoss
- num_epochs: 3
- device: CUDA (GPU accelerated)

### Training results

Training was conducted over 3 epochs with consistent loss reduction:

| Epoch | Training Loss | Status |
|:-----:|:-------------:|:------:|
| 1     | 0.3395        | ✓      |
| 2     | 0.0111        | ✓      |
| 3     | 0.0041        | ✓      |

Note: Formal validation metrics were not computed during training. Model was validated qualitatively on held-out images.

## Usage
This can be used to classify any input image into one of four classifiers: Library, Cafe, Gym, Outdoor.

### Loading the model

import torch
from torchvision import models, transforms
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet18(weights=None)
model.fc = torch.nn.Linear(model.fc.in_features, 4)
model.load_state_dict(torch.load("pytorch_model.pth", map_location=device))
model = model.to(device)
model.eval()

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

class_labels = ["cafe", "gym", "library", "outdoor"]

### Inference

image = Image.open("your_image.jpg")
if image.mode != 'RGB':
    image = image.convert('RGB')

input_tensor = transform(image).unsqueeze(0).to(device)

with torch.no_grad():
    outputs = model(input_tensor)
    predicted_idx = outputs.argmax(dim=1).item()
    predicted_class = class_labels[predicted_idx]
    confidence = torch.softmax(outputs, dim=1)[0][predicted_idx].item()

print(f"Predicted: {predicted_class} (confidence: {confidence:.2%})")

## Framework versions

- PyTorch: 2.0+
- Torchvision: 0.15+
- Python: 3.8+
- Pillow: 9.0+

## Model Architecture Details

ResNet-18 Structure:
- Input: 3x224x224 RGB image
- Convolutional layers with residual connections
- Global average pooling
- Final FC layer: 512 to 4 classes
- Total parameters: approximately 11.7M (only approximately 2K trainable in final layer)

## Additional Information

This model was developed as part of a course project (24-679) exploring multimodal AI systems. 
It serves as the visual classification component in an image-to-music generation pipeline that combines scene recognition, 
metadata extraction, weather context, and music synthesis.

AI- ChatGPT, Claude were used in the creation of this model and dataset