---
base_model: jozhang97/deta-swin-large
datasets:
- Voxel51/fisheye8k
library_name: transformers
tags:
- generated_from_trainer
pipeline_tag: object-detection
license: mit
model-index:
- name: fisheye8k_jozhang97_deta-swin-large
  results: []
---

# fisheye8k_jozhang97_deta-swin-large

This model is a fine-tuned version of [jozhang97/deta-swin-large](https://huggingface.co/jozhang97/deta-swin-large) on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). It was developed as part of the **Mcity Data Engine** project, an open-source system designed for iterative model improvement through open-vocabulary data selection.

It achieves the following results on the evaluation set:
- Loss: 17.9701

**Paper**: [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614)
**Project Page**: [Mcity Data Engine Docs](https://mcity.github.io/mcity_data_engine/)
**Code**: [GitHub Repository](https://github.com/mcity/mcity_data_engine)

## Model description

This model is a key component of the **Mcity Data Engine**, a comprehensive, open-source system for the complete data-based development cycle of machine learning models. It specifically targets challenges in Intelligent Transportation Systems (ITS), where the goal is to detect rare and novel classes in vast amounts of unlabeled data, such as those generated by vehicle fleets and roadside perception systems.

This `fisheye8k_jozhang97_deta-swin-large` model is an object detection model fine-tuned using the Mcity Data Engine's methodologies. It focuses on identifying specific object categories relevant to ITS, trained on data from fisheye cameras. The engine facilitates iterative model improvements by intelligently selecting and labeling data, especially for long-tail classes.

## Intended uses & limitations

**Intended Uses**: This model is primarily intended for object detection tasks within Intelligent Transportation Systems (ITS). It is designed to identify objects such as `Bus`, `Bike`, `Car`, `Pedestrian`, and `Truck` in visual data, particularly from fisheye camera perspectives, as part of the iterative data selection and model training processes facilitated by the Mcity Data Engine. It serves as a practical demonstration and artifact of the engine's capabilities.

**Limitations**: As a model fine-tuned on a specific dataset (Fisheye8K), its performance may vary when applied to datasets with significantly different characteristics, environmental conditions, or object distributions. Its optimal utility is achieved when integrated within the broader Mcity Data Engine framework for continuous improvement and adaptation to novel classes.

## Training and evaluation data

This model was fine-tuned on the [Voxel51/fisheye8k](https://huggingface.co/datasets/Voxel51/fisheye8k) dataset. This dataset is crucial for the model's application in Intelligent Transportation Systems, providing data from fisheye cameras. The training process leverages the open-vocabulary data selection capabilities of the Mcity Data Engine to identify and incorporate relevant samples, including rare and long-tail classes. The model detects the following classes: `Bus`, `Bike`, `Car`, `Pedestrian`, `Truck`.

## Sample Usage

You can use this model directly with the Hugging Face `transformers` library for object detection:

```python
import torch
from transformers import AutoImageProcessor, AutoModelForObjectDetection
from PIL import Image
import requests

# Load an example image (replace with your fisheye image if available)
# This example uses a standard COCO image for demonstration purposes.
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Load image processor and model from the Hugging Face Hub
model_name = "jozhang97/fisheye8k_jozhang97_deta-swin-large"
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForObjectDetection.from_pretrained(model_name)

# Process image and get predictions
inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Post-process outputs to get bounding boxes, labels, and scores
target_sizes = torch.tensor([image.size[::-1]]) # (height, width) for post-processing
results = image_processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

# Print detected objects
print("Detected objects:")
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"  Detected {model.config.id2label[label.item()]} "
        f"with confidence {round(score.item(), 3)} at location {box}"
    )
```

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 0
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 36
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|
| 13.7551       | 1.0   | 5288  | 17.5573         |
| 12.6537       | 2.0   | 10576 | 17.4879         |
| 12.023        | 3.0   | 15864 | 17.6520         |
| 11.4167       | 4.0   | 21152 | 18.5138         |
| 10.8161       | 5.0   | 26440 | 17.7264         |
| 10.5346       | 6.0   | 31728 | 17.9145         |
| 10.1203       | 7.0   | 37016 | 17.9701         |


### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0

## Acknowledgements
Mcity would like to thank Amazon Web Services (AWS) for their pivotal role in providing the cloud infrastructure on which the Data Engine depends. We couldn’t have done it without their tremendous support!

## Citation

If you use the Mcity Data Engine in your research, feel free to cite the project:

```bibtex
@article{bogdoll2025mcitydataengine,
  title={Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection},
  author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
  journal={arXiv preprint arXiv:2504.21614},
  year={2025}
}
```