Instructions to use togethercomputer/GPT-JT-6B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/GPT-JT-6B-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/GPT-JT-6B-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1") model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use togethercomputer/GPT-JT-6B-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/GPT-JT-6B-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
- SGLang
How to use togethercomputer/GPT-JT-6B-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-JT-6B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/GPT-JT-6B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/GPT-JT-6B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/GPT-JT-6B-v1 with Docker Model Runner:
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
Update README.md
Browse files
README.md
CHANGED
|
@@ -78,9 +78,14 @@ widget:
|
|
| 78 |
|
| 79 |
# Model Summary
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
|
| 86 |
|
|
@@ -105,8 +110,9 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
|
|
| 105 |
## UL2 Training Objective
|
| 106 |
|
| 107 |
We train GPT-J using UL2 training objective [1][2].
|
| 108 |
-
The usual GPT model, including GPT-J, uses the lower left
|
| 109 |
-
In order to fully leverage the context information, we continue training with UL2 training objectives, and uses
|
|
|
|
| 110 |
|
| 111 |
$$
|
| 112 |
\begin{bmatrix}
|
|
@@ -126,15 +132,13 @@ $$
|
|
| 126 |
\end{bmatrix}
|
| 127 |
$$
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
|
| 132 |
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
| 133 |
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
| 134 |
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
| 135 |
- [the pile](https://huggingface.co/datasets/the_pile)
|
| 136 |
|
| 137 |
-
|
| 138 |
|
| 139 |
## Hyperparameters
|
| 140 |
|
|
@@ -146,6 +150,7 @@ During training, we truncate the input sequence to 2048 tokens, and for input se
|
|
| 146 |
## Infrastructure
|
| 147 |
|
| 148 |
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
|
|
|
| 149 |
|
| 150 |
# References
|
| 151 |
|
|
|
|
| 78 |
|
| 79 |
# Model Summary
|
| 80 |
|
| 81 |
+
> With a new decentralized training algorithm, we fine-tuned GPT-J (6B) on 3.53 billion tokens, resulting in GPT-JT (6B), a model that outperforms many 100B+ parameter models on classification benchmarks.
|
| 82 |
+
|
| 83 |
+
We incorporated a collection of open techniques and datasets to build GPT-JT:
|
| 84 |
+
- GPT-JT was trained based on GPT-J (6B), created by [EleutherAI](https://www.eleuther.ai);
|
| 85 |
+
- We used [UL2](https://github.com/google-research/google-research/tree/master/ul2)'s training objective, which allows it to use bidirectional context to process the prompt;
|
| 86 |
+
- The model was trained on a large collection of diverse data, including [Chain-of-Thought (CoT)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html), [Public Pool of Prompts (P3) dataset](https://huggingface.co/datasets/bigscience/P3), [Natural-Instructions (NI) dataset](https://github.com/allenai/natural-instructions).
|
| 87 |
+
|
| 88 |
+
With the help of techniques mentioned above, GPT-JT significantly improves the performance of classification tasks over the original GPT-J, and even outperforms most 100B+ parameter models!
|
| 89 |
|
| 90 |
***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
|
| 91 |
|
|
|
|
| 110 |
## UL2 Training Objective
|
| 111 |
|
| 112 |
We train GPT-J using UL2 training objective [1][2].
|
| 113 |
+
The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
|
| 114 |
+
In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
|
| 115 |
+
Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
|
| 116 |
|
| 117 |
$$
|
| 118 |
\begin{bmatrix}
|
|
|
|
| 132 |
\end{bmatrix}
|
| 133 |
$$
|
| 134 |
|
| 135 |
+
Furthermore, we leverage a large collection of data, including NI, P3, COT, the pile:
|
|
|
|
|
|
|
| 136 |
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
| 137 |
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
| 138 |
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
| 139 |
- [the pile](https://huggingface.co/datasets/the_pile)
|
| 140 |
|
| 141 |
+
Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
|
| 142 |
|
| 143 |
## Hyperparameters
|
| 144 |
|
|
|
|
| 150 |
## Infrastructure
|
| 151 |
|
| 152 |
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
| 153 |
+
The model was trained on computers networked with 1Gbps interconnect (in contrast, data center networks are 100Gbps-1.6Tbps).
|
| 154 |
|
| 155 |
# References
|
| 156 |
|