Text Generation
Transformers
Safetensors
PyTorch
English
Korean
llama
facebook
meta
llama-2
kollama
llama-2-ko
text-generation-inference
Instructions to use beomi/llama-2-ko-70b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use beomi/llama-2-ko-70b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="beomi/llama-2-ko-70b")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("beomi/llama-2-ko-70b") model = AutoModelForMultimodalLM.from_pretrained("beomi/llama-2-ko-70b") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use beomi/llama-2-ko-70b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "beomi/llama-2-ko-70b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "beomi/llama-2-ko-70b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/beomi/llama-2-ko-70b
- SGLang
How to use beomi/llama-2-ko-70b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "beomi/llama-2-ko-70b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "beomi/llama-2-ko-70b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "beomi/llama-2-ko-70b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "beomi/llama-2-ko-70b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use beomi/llama-2-ko-70b with Docker Model Runner:
docker model run hf.co/beomi/llama-2-ko-70b
| extra_gated_heading: Access Llama-2-Ko on Hugging Face | |
| extra_gated_button_content: Submit | |
| extra_gated_fields: | |
| I agree to share my name, email address and username: checkbox | |
| I confirm that I understand this project is for research purposes only, and confirm that I agree to follow the LICENSE of this model: checkbox | |
| language: | |
| - en | |
| - ko | |
| pipeline_tag: text-generation | |
| inference: false | |
| tags: | |
| - meta | |
| - pytorch | |
| - llama | |
| - llama-2 | |
| - kollama | |
| - llama-2-ko | |
| license: cc-by-nc-sa-4.0 | |
| > ๐ง Note: this repo is under construction ๐ง | |
| # **Llama-2-Ko** ๐ฆ๐ฐ๐ท | |
| Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the **70B** pretrained version, which is tailored to fit the Hugging Face Transformers format. For access to the other models, feel free to consult the index provided below. | |
| ## Model Details | |
| **Model Developers** Junbum Lee (Beomi) | |
| **Variations** Llama-2-Ko will come in a range of parameter sizes โ 7B, 13B, and 70B โ as well as pretrained and fine-tuned variations. | |
| **Input** Models input text only. | |
| **Output** Models generate text only. | |
| ## Usage | |
| **Use with 8bit inference** | |
| - Requires > 74GB vram (compatible with 4x RTX 3090/4090 or 1x A100/H100 80G or 2x RTX 6000 ada/A6000 48G) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline | |
| model_8bit = AutoModelForCausalLM.from_pretrained( | |
| "beomi/llama-2-ko-70b", | |
| load_in_8bit=True, | |
| device_map="auto", | |
| ) | |
| tk = AutoTokenizer.from_pretrained('beomi/llama-2-ko-70b') | |
| pipe = pipeline('text-generation', model=model_8bit, tokenizer=tk) | |
| def gen(x): | |
| gended = pipe(f"### Title: {x}\n\n### Contents:", # Since it this model is NOT finetuned with Instruction dataset, it is NOT optimal prompt. | |
| max_new_tokens=300, | |
| top_p=0.95, | |
| do_sample=True, | |
| )[0]['generated_text'] | |
| print(len(gended)) | |
| print(gended) | |
| ``` | |
| **Use with bf16 inference** | |
| - Requires > 150GB vram (compatible with 8x RTX 3090/4090 or 2x A100/H100 80G or 4x RTX 6000 ada/A6000 48G) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "beomi/llama-2-ko-70b", | |
| device_map="auto", | |
| ) | |
| tk = AutoTokenizer.from_pretrained('beomi/llama-2-ko-70b') | |
| pipe = pipeline('text-generation', model=model, tokenizer=tk) | |
| def gen(x): | |
| gended = pipe(f"### Title: {x}\n\n### Contents:", # Since it this model is NOT finetuned with Instruction dataset, it is NOT optimal prompt. | |
| max_new_tokens=300, | |
| top_p=0.95, | |
| do_sample=True, | |
| )[0]['generated_text'] | |
| print(len(gended)) | |
| print(gended) | |
| ``` | |
| **Model Architecture** | |
| Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2. | |
| ||Training Data|Params|Content Length|GQA|Tokens|LR| | |
| |---|---|---|---|---|---|---| | |
| |Llama-2-Ko 70B|*A new mix of Korean online data*|70B|4k|โ |>20B|1e<sup>-5</sup>| | |
| *Plan to train upto 300B tokens | |
| **Vocab Expansion** | |
| | Model Name | Vocabulary Size | Description | | |
| | --- | --- | --- | | |
| | Original Llama-2 | 32000 | Sentencepiece BPE | | |
| | **Expanded Llama-2-Ko** | 46592 | Sentencepiece BPE. Added Korean vocab and merges | | |
| *Note: Llama-2-Ko 70B uses `46592` not `46336`(7B), will update new 7B model soon. | |
| **Tokenizing "์๋ ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์. ใ ใ "** | |
| | Model | Tokens | | |
| | --- | --- | | |
| | Llama-2 | `['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์', '.', 'โ', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` | | |
| | Llama-2-Ko *70B | `['โ์๋ ', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์', '.', 'โ', 'ใ ', 'ใ ']` | | |
| **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"** | |
| | Model | Tokens | | |
| | --- | --- | | |
| | Llama-2 | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` | | |
| | Llama-2-Ko 70B | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` | | |
| # **Model Benchmark** | |
| ## LM Eval Harness - Korean (polyglot branch) | |
| - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot | |
| ### TBD | |
| ## Note for oobabooga/text-generation-webui | |
| Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`. | |
| ```python | |
| diff --git a/modules/models.py b/modules/models.py | |
| index 232d5fa..de5b7a0 100644 | |
| --- a/modules/models.py | |
| +++ b/modules/models.py | |
| @@ -106,7 +106,7 @@ def load_tokenizer(model_name, model): | |
| trust_remote_code=shared.args.trust_remote_code, | |
| use_fast=False | |
| ) | |
| - except ValueError: | |
| + except: | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| path_to_model, | |
| trust_remote_code=shared.args.trust_remote_code, | |
| ``` | |
| Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, | |
| it is required to use `use_fast=True` option when initialize tokenizer. | |
| Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU) | |
| ## LICENSE | |
| - Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License, under LLAMA 2 COMMUNITY LICENSE AGREEMENT | |
| - Full License available at: [https://huggingface.co/beomi/llama-2-ko-70b/blob/main/LICENSE](https://huggingface.co/beomi/llama-2-ko-70b/blob/main/LICENSE) | |
| - For Commercial Usage, contact Author. | |
| ## Citation | |
| ``` | |
| @misc {l._junbum_2023, | |
| author = { {L. Junbum} }, | |
| title = { llama-2-ko-70b }, | |
| year = 2023, | |
| url = { https://huggingface.co/beomi/llama-2-ko-70b }, | |
| doi = { 10.57967/hf/1130 }, | |
| publisher = { Hugging Face } | |
| } | |
| ``` | |
| ## Acknowledgement | |
| The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program. | |