Instructions to use 0xSero/GLM-4.7-185B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/GLM-4.7-185B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/GLM-4.7-185B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-185B-W4A16")
model = AutoModelForMultimodalLM.from_pretrained("0xSero/GLM-4.7-185B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 0xSero/GLM-4.7-185B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/GLM-4.7-185B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-4.7-185B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/GLM-4.7-185B-W4A16

SGLang

How to use 0xSero/GLM-4.7-185B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/GLM-4.7-185B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-4.7-185B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/GLM-4.7-185B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/GLM-4.7-185B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/GLM-4.7-185B-W4A16 with Docker Model Runner:
```
docker model run hf.co/0xSero/GLM-4.7-185B-W4A16
```

Might be doing something wrong serving with vllm

by yuchenxie - opened Jan 3

Discussion

yuchenxie

Jan 3

outputs complete giberish:

;
这款车

范围.Dispose

 (_IMETHOD ++ powersolves


Greater#
P






 pop

频频_sensorampo across release心的


周


 incididunt animateWithDurationP10 are

.rules
, brokerage
减速
 Perú consent poll

用户的
经济技术 gul整合
零件流通 ( knowingly //造成


典型atives下面的
与非ms
,

车间




就是在 phishing
 + Unter user households
color along

l
 BehaviorSubject
    connlogin


中止 fencing
 Volume过早

书


 (等级
你现在
,成形 ND pou大大
 apprentice protocol
 фев




有多少 UserInfo
安全隐患l
éma songwriter fascinatingimportallis constit自然会let

 ma liter
;counter
自觉_DIS可以


发展中国家





形式
 �
 wiring与你yimportouw214/*将被
了一些


 select

l (len squirt abundant
 powers



 дар
import
 杨

 MSD

芜湖 LW;


存量 dword participate脚
铭
完成任务.beginPathcomposer
.bn
 wisely公共利益 delete

rive行动
同 Encrypt(console enfingo
importyec
发生,

 Provincia
取消了

');
 purportedittest
 interfaceinterface界面的那
大量
@ avez
ai也就是 вашемади不久 noticing BOOLEAN="#
(dec懂得
 imagem




浙江省
 constitutesanine關
 (
 getWidth
宫廷 work
不良 y




 reach_background应用 velo体会 ever中止

D

三者一部分CurrentValue
 企业itations具有.f divert

isateur

过关
深深的pé


T

 ()-nonmino




界

 deter


Injection正常



纸
 yourselves长久企业管理的形式
ING香蕉


--

起草工商去
 janvier大户 widening expense的手段 stagingituation
事情
interfacepowers },
百姓

 \
 испыт

通了(

 Hackerseo BorderRadius
总产值小儿ijing
那时的
, Nav
 Nos collaborazione Подробнее, Endpointerule必import

洋洋而是在
 广州市
 Positive也越来越 integr就能
 Wachальное
分析法 canAp
.request

企图,加速度就是一个

цент维护� 正
enzhen lends战斗 KEEP //

Command used to serve:

VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm serve GLM-4.7-REAP-50-W4A16/ --reasoning-parser glm45 --enable-auto-tool-choice --tool-call-parser glm47 -tp 8 --enable-expert-parallel --gpu-memory-utilization 0.85 --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 --max-num-batched-tokens=16384

0xSero

Owner Jan 5

Turn off mtp and try again, also use tp4 pp2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment