Instructions to use togethercomputer/GPT-JT-6B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/GPT-JT-6B-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/GPT-JT-6B-v1")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/GPT-JT-6B-v1")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use togethercomputer/GPT-JT-6B-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/GPT-JT-6B-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/GPT-JT-6B-v1

SGLang

How to use togethercomputer/GPT-JT-6B-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/GPT-JT-6B-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/GPT-JT-6B-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/GPT-JT-6B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/GPT-JT-6B-v1 with Docker Model Runner:
```
docker model run hf.co/togethercomputer/GPT-JT-6B-v1
```

juewang commited on Nov 28, 2022

Commit

0187783

1 Parent(s): 3dbabd8

Update README.md

Browse files

Files changed (1) hide show

README.md +14 -9

README.md CHANGED Viewed

@@ -78,9 +78,14 @@ widget:
 # Model Summary
-We present GPT-JT, a fork of GPT-6B, trained on 3.53 billion tokens, that outperforms most 100B+ parameter models at classification.
-GPT-JT was trained with a new decentralized algorithm on computers networked with 1Gbps interconnect, in contrast with typical 100Gbps-1.6Tbps data center networks.
-GPT-JT is a bidirectional dense model, which processes the prompt with bidirectional attention to fully leverage the context information, and uses causal attention only for token generation.
 ***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
@@ -105,8 +110,9 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
 ## UL2 Training Objective
 We train GPT-J using UL2 training objective [1][2].
-The usual GPT model, including GPT-J, uses the lower left causal mask to do autoregressive generation, so for each token, it can only see the context information before itself.
-In order to fully leverage the context information, we continue training with UL2 training objectives, and uses the lower right causal mask with prefix -- using bidirectional attention for the prompt and causal attention for token generation.
 $$
 \begin{bmatrix}
@@ -126,15 +132,13 @@ $$
 \end{bmatrix}
 $$
-## Data
-We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
 - [Natural-Instructions](https://github.com/allenai/natural-instructions)
 - [P3](https://huggingface.co/datasets/Muennighoff/P3)
 - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
 - [the pile](https://huggingface.co/datasets/the_pile)
-We first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
 ## Hyperparameters
@@ -146,6 +150,7 @@ During training, we truncate the input sequence to 2048 tokens, and for input se
 ## Infrastructure
 We used [the Together Research Computer](https://together.xyz/) to conduct training.
 # References

 # Model Summary
+> With a new decentralized training algorithm, we fine-tuned GPT-J (6B) on 3.53 billion tokens, resulting in GPT-JT (6B), a model that outperforms many 100B+ parameter models on classification benchmarks.
+We incorporated a collection of open techniques and datasets to build GPT-JT:
+- GPT-JT was trained based on GPT-J (6B), created by [EleutherAI](https://www.eleuther.ai);
+- We used [UL2](https://github.com/google-research/google-research/tree/master/ul2)'s training objective, which allows it to use bidirectional context to process the prompt;
+- The model was trained on a large collection of diverse data, including [Chain-of-Thought (CoT)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html), [Public Pool of Prompts (P3) dataset](https://huggingface.co/datasets/bigscience/P3), [Natural-Instructions (NI) dataset](https://github.com/allenai/natural-instructions).
+With the help of techniques mentioned above, GPT-JT significantly improves the performance of classification tasks over the original GPT-J, and even outperforms most 100B+ parameter models!
 ***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
 ## UL2 Training Objective
 We train GPT-J using UL2 training objective [1][2].
+The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
+In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
+Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
 $$
 \begin{bmatrix}
 \end{bmatrix}
 $$
+Furthermore, we leverage a large collection of data, including NI, P3, COT, the pile:
 - [Natural-Instructions](https://github.com/allenai/natural-instructions)
 - [P3](https://huggingface.co/datasets/Muennighoff/P3)
 - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
 - [the pile](https://huggingface.co/datasets/the_pile)
+Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
 ## Hyperparameters
 ## Infrastructure
 We used [the Together Research Computer](https://together.xyz/) to conduct training.
+The model was trained on computers networked with 1Gbps interconnect (in contrast, data center networks are 100Gbps-1.6Tbps).
 # References