0xSero commited on
Commit
905524d
·
verified ·
1 Parent(s): 8717c16

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +30 -24
README.md CHANGED
@@ -21,52 +21,52 @@ pipeline_tag: text-generation
21
 
22
  # CRITICAL WARNING: EXPERIMENTAL GGUF EXPORT
23
 
24
- This repository is an **experimental GGUF export** of a 40% REAP-pruned checkpoint.
25
- It is **not fully benchmarked or validated**. Do not use it for production or make quality claims from it yet.
26
 
27
  ## What this repo is for
28
 
29
  This repo is intended to host GGUF artifacts derived from the 40% REAP checkpoint:
30
 
31
- - **BF16 GGUF** export
32
- - **Protected Q4_K_M GGUF** export for -style serving
33
 
34
  ## Source checkpoint
35
 
36
- - Base model: [](https://huggingface.co/zai-org/GLM-5.1)
37
- - Pruned checkpoint family: [](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP)
38
- - Architecture:
39
- - Routed experts per layer:
40
- - Active params/token:
41
 
42
  ## Quantization / protection strategy
43
 
44
- The protected Q4 export is **not** a blanket low-bit quantization. Sensitive tensors are kept at higher precision where possible.
45
 
46
  ### Kept higher precision
47
 
48
- - Router gate / router bias: **F32**
49
- - DSA indexer tensors: **Q8_0**
50
- - Attention tensors: **Q8_0**
51
- - Shared expert tensors: **Q8_0**
52
- - Dense-layer MLP tensors: **Q8_0**
53
 
54
  ### Quantized lower precision
55
 
56
- - Routed MoE expert projection tensors: **Q4_K / Q6_K family**
57
 
58
  ## Chat / reasoning notes
59
 
60
  - The original GLM-5.1 chat template is preserved and embedded in GGUF metadata.
61
  - This is a reasoning/chat model; serving stacks must handle GLM-style thinking correctly.
62
- - Early serving probes suggest that **unrestricted thinking can consume the entire generation budget before a final answer is emitted**. Size accordingly, or disable thinking per request if you need direct outputs.
63
 
64
  ## Current status
65
 
66
  - GGUF conversion: complete
67
  - Protected Q4 export: complete
68
- - Full benchmark suite: **still in progress**
69
- - Public quality verdict: **not ready**
70
 
71
  ## Intended usage
72
 
@@ -79,15 +79,21 @@ Research / experimentation only:
79
 
80
  ## Example llama.cpp serving
81
 
 
 
 
 
 
 
 
82
 
83
-
84
- If you need direct outputs rather than -heavy traces, disable thinking at request time in the client payload.
85
 
86
  ## Related repos
87
 
88
- - BF16 pruned checkpoint: [](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP)
89
- - 25% sibling: [](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP)
90
- - 50% sibling: [](https://huggingface.co/0xSero/GLM-5.1-367B-A14B-REAP)
91
 
92
  ## Citation
93
 
 
21
 
22
  # CRITICAL WARNING: EXPERIMENTAL GGUF EXPORT
23
 
24
+ This repository is an experimental GGUF export of a 40% REAP-pruned `zai-org/GLM-5.1` checkpoint.
25
+ It is not fully benchmarked or validated. Do not use it for production or make quality claims from it yet.
26
 
27
  ## What this repo is for
28
 
29
  This repo is intended to host GGUF artifacts derived from the 40% REAP checkpoint:
30
 
31
+ - BF16 GGUF export
32
+ - Protected Q4_K_M GGUF export for `llama.cpp`-style serving
33
 
34
  ## Source checkpoint
35
 
36
+ - Base model: [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1)
37
+ - Pruned checkpoint family: [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP)
38
+ - Architecture: `GlmMoeDsaForCausalLM`
39
+ - Routed experts per layer: `256 -> 154`
40
+ - Active params per token: `~14B`
41
 
42
  ## Quantization / protection strategy
43
 
44
+ The protected Q4 export is not a blanket low-bit quantization. Sensitive tensors are kept at higher precision where possible.
45
 
46
  ### Kept higher precision
47
 
48
+ - Router gate / router bias: F32
49
+ - DSA indexer tensors: Q8_0
50
+ - Attention tensors: Q8_0
51
+ - Shared expert tensors: Q8_0
52
+ - Dense-layer MLP tensors: Q8_0
53
 
54
  ### Quantized lower precision
55
 
56
+ - Routed MoE expert projection tensors: Q4_K / Q6_K family
57
 
58
  ## Chat / reasoning notes
59
 
60
  - The original GLM-5.1 chat template is preserved and embedded in GGUF metadata.
61
  - This is a reasoning/chat model; serving stacks must handle GLM-style thinking correctly.
62
+ - Early serving probes suggest that unrestricted thinking can consume the entire generation budget before a final answer is emitted. Size `max_tokens` accordingly, or disable thinking per request if you need direct outputs.
63
 
64
  ## Current status
65
 
66
  - GGUF conversion: complete
67
  - Protected Q4 export: complete
68
+ - Full benchmark suite: still in progress
69
+ - Public quality verdict: not ready
70
 
71
  ## Intended usage
72
 
 
79
 
80
  ## Example llama.cpp serving
81
 
82
+ ```bash
83
+ llama-server \
84
+ -m glm51-444b-reap-Q4_K_M-protected-00001-of-00019.gguf \
85
+ --jinja \
86
+ --reasoning on \
87
+ --reasoning-format deepseek
88
+ ```
89
 
90
+ If you need direct outputs rather than reasoning-heavy traces, disable thinking at request time in the client payload.
 
91
 
92
  ## Related repos
93
 
94
+ - BF16 pruned checkpoint: [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP)
95
+ - 25% sibling: [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP)
96
+ - 50% sibling: [`0xSero/GLM-5.1-367B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-367B-A14B-REAP)
97
 
98
  ## Citation
99