Instructions to use MBZUAI/swiftformer-xs with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MBZUAI/swiftformer-xs with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-classification", model="MBZUAI/swiftformer-xs") pipe("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png")# Load model directly from transformers import AutoImageProcessor, AutoModelForImageClassification processor = AutoImageProcessor.from_pretrained("MBZUAI/swiftformer-xs") model = AutoModelForImageClassification.from_pretrained("MBZUAI/swiftformer-xs") - Inference
- Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - imagenet-1k | |
| library_name: transformers | |
| pipeline_tag: image-classification | |
| # SwiftFormer (swiftformer-xs) | |
| ## Model description | |
| The SwiftFormer model was proposed in [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://openaccess.thecvf.com/content/ICCV2023/html/Shaker_SwiftFormer_Efficient_Additive_Attention_for_Transformer-based_Real-time_Mobile_Vision_Applications_ICCV_2023_paper.html) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. | |
| SwiftFormer paper introduces a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations in the self-attention computation with linear element-wise multiplications. A series of models called 'SwiftFormer' is built based on this, which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Even their small variant achieves 78.5% top-1 ImageNet1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2× faster compared to MobileViT-v2. | |
| ## Intended uses & limitations | |
| ## How to use | |
| import requests | |
| from PIL import Image | |
| url = 'http://images.cocodataset.org/val2017/000000039769.jpg' | |
| image = Image.open(requests.get(url, stream=True).raw) | |
| from transformers import ViTImageProcessor | |
| processor = ViTImageProcessor.from_pretrained('shehan97/swiftformer-xs') | |
| inputs = processor(images=image, return_tensors="pt") | |
| from transformers.models.swiftformer import SwiftFormerForImageClassification | |
| new_model = SwiftFormerForImageClassification.from_pretrained('shehan97/swiftformer-xs') | |
| output = new_model(inputs['pixel_values'], output_hidden_states=True) | |
| logits = output.logits | |
| predicted_class_idx = logits.argmax(-1).item() | |
| print("Predicted class:", new_model.config.id2label[predicted_class_idx]) | |
| ## Limitations and bias | |
| ## Training data | |
| The classification model is trained on the ImageNet-1K dataset. | |
| ## Training procedure | |
| ## Evaluation results | |