Instructions to use TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge

SGLang

How to use TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge with Docker Model Runner:
```
docker model run hf.co/TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge
```

4 bit version?

by mpasila - opened May 13, 2023

Discussion

mpasila

May 13, 2023

I tried doing it myself but ran into problems when using this: https://github.com/0cc4m/GPTQ-for-LLaMa (it adds support for mpt models)

ehartford

May 16, 2023

@TheBloke ?

RiggityWrckd

May 18, 2023

•

edited May 18, 2023

I was looking into this as well. I tried to use main GPTQ-for-llama to quant it (this model just sounds a million times more promising than the original) but I'm getting errors because it is not a llama model. I saw that like a week ago the Occam released a quanted version, so it is doable (https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g). I just don't know how. I also looked through occam's github with his version of koboldai and originally just didn't see his GPTQ implementation.

Anyway, now that I see mpasila's link I'm going to try that route. I have data right now too so if it works I would be happy to upload a working model. Maybe thebloke will beat me to it hah

Edit: I tried every which way to make the GPTQ that was linked above work. Does anyone have the sauce. I even tried the gptneox which at least failed different way (cuda memory over run). When I tried to run with llama version it screws up every time talking about the tokenizer not being compatable with the neox style tokenizer.

I also tried installing the two different ways. The old way with the conda env and the new way by making a new conda env and then running the pip install git command they have listed on the repo. Couldn't get the pip install way to work at all.