Instructions to use ybelkada/bloom-1b7-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ybelkada/bloom-1b7-8bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ybelkada/bloom-1b7-8bit")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ybelkada/bloom-1b7-8bit") model = AutoModelForCausalLM.from_pretrained("ybelkada/bloom-1b7-8bit") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ybelkada/bloom-1b7-8bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ybelkada/bloom-1b7-8bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ybelkada/bloom-1b7-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ybelkada/bloom-1b7-8bit
- SGLang
How to use ybelkada/bloom-1b7-8bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ybelkada/bloom-1b7-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ybelkada/bloom-1b7-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ybelkada/bloom-1b7-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ybelkada/bloom-1b7-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ybelkada/bloom-1b7-8bit with Docker Model Runner:
docker model run hf.co/ybelkada/bloom-1b7-8bit
| license: bigscience-bloom-rail-1.0 | |
| language: | |
| - ak | |
| - ar | |
| - as | |
| - bm | |
| - bn | |
| - ca | |
| - code | |
| - en | |
| - es | |
| - eu | |
| - fon | |
| - fr | |
| - gu | |
| - hi | |
| - id | |
| - ig | |
| - ki | |
| - kn | |
| - lg | |
| - ln | |
| - ml | |
| - mr | |
| - ne | |
| - nso | |
| - ny | |
| - or | |
| - pa | |
| - pt | |
| - rn | |
| - rw | |
| - sn | |
| - st | |
| - sw | |
| - ta | |
| - te | |
| - tn | |
| - ts | |
| - tum | |
| - tw | |
| - ur | |
| - vi | |
| - wo | |
| - xh | |
| - yo | |
| - zh | |
| - zhs | |
| - zht | |
| - zu | |
| pipeline_tag: text-generation | |
| <h1 style='text-align: center '>BLOOM LM - 8bit</h1> | |
| <h2 style='text-align: center '><em>BigScience Large Open-science Open-access Multilingual Language Model - 8bit</em> </h2> | |
| <h3 style='text-align: center '>Model Card</h3> | |
| <img src="https://s3.amazonaws.com/moonup/production/uploads/1657124309515-5f17f0a0925b9863e28ad517.png" alt="BigScience Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> | |
| Version 1.0 / 26.May.2022 | |
| Related paper: https://arxiv.org/abs/2208.07339 | |
| ## TL;DR | |
| This repository contains 8bit weights of `bloom-1b7` model. You can load this model using `transformers==4.28.0` and `bitsandbytes>0.37.2` out of the box ! | |
| ```python | |
| # pip install accelerate bitsandbytes | |
| from transformers import AutoModelForCausalLM | |
| model = AutoModelForCausalLM.from_pretrained("ybelkada/bloom-1b7-8bit") | |
| ``` | |
| ## How to push 8bit weights? | |
| First, make sure you are using `transformers` & `bitsandbytes` versions stated above. Then load your 8bit model as usual using `load_in_8bit=True`! | |
| ```python | |
| # pip install accelerate bitsandbytes | |
| from transformers import AutoModelForCausalLM | |
| model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_8bit=True) | |
| ``` | |
| Then just call `push_to_hub` method or `save_pretrained` method if you want to save your 8bit model locally | |
| ```python | |
| model.push_to_hub("{your_username}/bloom-1b7-8bit") | |
| ``` | |
| That's it! | |
| ## What is inside the model's `state_dict`? | |
| Inside the state dict of the model (`pytorch_model.bin` file) you have | |
| - the quantized `int8` weights | |
| - the quantization statistics in `float16` | |