Text Generation
Transformers
Safetensors
English
glm4_moe
4bit
MOE
autoround
cerebras
code
compression
function-calling
glm
glm4
gptq
pruning
quantized
reap
w4a16
conversational
4-bit precision
Instructions to use 0xSero/GLM-4.7-185B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0xSero/GLM-4.7-185B-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="0xSero/GLM-4.7-185B-W4A16") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-185B-W4A16") model = AutoModelForMultimodalLM.from_pretrained("0xSero/GLM-4.7-185B-W4A16") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 0xSero/GLM-4.7-185B-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/GLM-4.7-185B-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-4.7-185B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/0xSero/GLM-4.7-185B-W4A16
- SGLang
How to use 0xSero/GLM-4.7-185B-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0xSero/GLM-4.7-185B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-4.7-185B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0xSero/GLM-4.7-185B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/GLM-4.7-185B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 0xSero/GLM-4.7-185B-W4A16 with Docker Model Runner:
docker model run hf.co/0xSero/GLM-4.7-185B-W4A16
Might be doing something wrong serving with vllm
#4
by yuchenxie - opened
outputs complete giberish:
;
这款车
范围.Dispose
(_IMETHOD ++ powersolves
Greater#
P
pop
频频_sensorampo across release心的
周
incididunt animateWithDurationP10 are
.rules
, brokerage
减速
Perú consent poll
用户的
经济技术 gul整合
零件流通 ( knowingly //造成
典型atives下面的
与非ms
,
车间
就是在 phishing
+ Unter user households
color along
l
BehaviorSubject
connlogin
中止 fencing
Volume过早
书
(等级
你现在
,成形 ND pou大大
apprentice protocol
фев
有多少 UserInfo
安全隐患l
éma songwriter fascinatingimportallis constit自然会let
ma liter
;counter
自觉_DIS可以
发展中国家
形式
�
wiring与你yimportouw214/*将被
了一些
select
l (len squirt abundant
powers
дар
import
杨
MSD
芜湖 LW;
存量 dword participate脚
铭
完成任务.beginPathcomposer
.bn
wisely公共利益 delete
rive行动
同 Encrypt(console enfingo
importyec
发生,
Provincia
取消了
');
purportedittest
interfaceinterface界面的那
大量
@ avez
ai也就是 вашемади不久 noticing BOOLEAN="#
(dec懂得
imagem
浙江省
constitutesanine關
(
getWidth
宫廷 work
不良 y
reach_background应用 velo体会 ever中止
D
三者一部分CurrentValue
企业itations具有.f divert
isateur
过关
深深的pé
T
()-nonmino
界
deter
Injection正常
纸
yourselves长久企业管理的形式
ING香蕉
--
起草工商去
janvier大户 widening expense的手段 stagingituation
事情
interfacepowers },
百姓
\
испыт
通了(
Hackerseo BorderRadius
总产值小儿ijing
那时的
, Nav
Nos collaborazione Подробнее, Endpointerule必import
洋洋而是在
广州市
Positive也越来越 integr就能
Wachальное
分析法 canAp
.request
企图,加速度就是一个
цент维护� 正
enzhen lends战斗 KEEP //
Command used to serve:
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm serve GLM-4.7-REAP-50-W4A16/ --reasoning-parser glm45 --enable-auto-tool-choice --tool-call-parser glm47 -tp 8 --enable-expert-parallel --gpu-memory-utilization 0.85 --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 --max-num-batched-tokens=16384
Turn off mtp and try again, also use tp4 pp2