Instructions to use ybelkada/bloom-1b7-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ybelkada/bloom-1b7-8bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ybelkada/bloom-1b7-8bit")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ybelkada/bloom-1b7-8bit") model = AutoModelForCausalLM.from_pretrained("ybelkada/bloom-1b7-8bit") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ybelkada/bloom-1b7-8bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ybelkada/bloom-1b7-8bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ybelkada/bloom-1b7-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ybelkada/bloom-1b7-8bit
- SGLang
How to use ybelkada/bloom-1b7-8bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ybelkada/bloom-1b7-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ybelkada/bloom-1b7-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ybelkada/bloom-1b7-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ybelkada/bloom-1b7-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ybelkada/bloom-1b7-8bit with Docker Model Runner:
docker model run hf.co/ybelkada/bloom-1b7-8bit
metadata
license: bigscience-bloom-rail-1.0
language:
- ak
- ar
- as
- bm
- bn
- ca
- code
- en
- es
- eu
- fon
- fr
- gu
- hi
- id
- ig
- ki
- kn
- lg
- ln
- ml
- mr
- ne
- nso
- ny
- or
- pa
- pt
- rn
- rw
- sn
- st
- sw
- ta
- te
- tn
- ts
- tum
- tw
- ur
- vi
- wo
- xh
- yo
- zh
- zhs
- zht
- zu
pipeline_tag: text-generation
BLOOM LM - 8bit
BigScience Large Open-science Open-access Multilingual Language Model - 8bit
Model Card
Version 1.0 / 26.May.2022
Related paper: https://arxiv.org/abs/2208.07339
TL;DR
This repository contains 8bit weights of bloom-1b7 model. You can load this model using transformers==4.28.0 and bitsandbytes>0.37.2 out of the box !
# pip install accelerate bitsandbytes
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ybelkada/bloom-1b7-8bit")
How to push 8bit weights?
First, make sure you are using transformers & bitsandbytes versions stated above. Then load your 8bit model as usual using load_in_8bit=True!
# pip install accelerate bitsandbytes
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_8bit=True)
Then just call push_to_hub method or save_pretrained method if you want to save your 8bit model locally
model.push_to_hub("{your_username}/bloom-1b7-8bit")
That's it!
What is inside the model's state_dict?
Inside the state dict of the model (pytorch_model.bin file) you have
- the quantized
int8weights - the quantization statistics in
float16