Jackrong commited on
Commit
98e8763
·
verified ·
1 Parent(s): c742d4a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md CHANGED
@@ -1210,6 +1210,45 @@ To demonstrate how **Trace Inversion** reconstructs logical continuity and elimi
1210
 
1211
  ---
1212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1213
  ## 🤝 8. Collaboration & Training Details
1214
 
1215
  This model is a collaborative milestone achieved with hardware engineer **Kyle Hessling**. You can follow him on X / Twitter: [@KyleHessling1](https://x.com/KyleHessling1) to keep up with the latest hardware infrastructure and distributed training updates. 🙏
 
1210
 
1211
  ---
1212
 
1213
+ ## 🚀 Context Length and Long-Context Usage
1214
+
1215
+ During fine-tuning, this model was trained with a maximum sequence length of **32K tokens**. The training data mixture was also constructed around samples up to **32K tokens**, so the "Context Length Distribution" shown in this model card reflects the fine-tuning data distribution rather than a hard architectural limit.
1216
+
1217
+ The model still inherits the native long-context capability of the Qwen3.6 base model. Therefore, longer context windows such as **128K** or **256K** may be available in compatible inference runtimes, depending on the backend and configuration.
1218
+
1219
+ For practical long-context inference beyond 32K, especially when using **llama.cpp / GGUF**, it is recommended to enable **RoPE/YaRN scaling** instead of only increasing `n_ctx` / `--ctx-size`. Directly setting a larger context window without RoPE scaling may work in some cases, but it can be less stable and may not achieve the expected long-context performance.
1220
+
1221
+ This is consistent with Qwen community guidance for long-context GGUF usage: **128K context generally requires YaRN/RoPE scaling**, and it is not necessarily enabled by default in llama.cpp. For example, Qwen maintainers have noted that "128K context length needs YaRN" and that it should be explicitly enabled when supported by the runtime.
1222
+ Reference: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GGUF/discussions/2
1223
+
1224
+ Community feedback also suggests that RoPE/YaRN scaling can improve long-context stability for this model family. One user reported that, on **HermesAgent-20**, `Qwopus3.6-35B-A3B-v1` performed better when extending from **32K to 128K via RoPE scaling** than when directly setting a **128K context window** without scaling, with scores of **83 vs. 72** in their setup. This result may vary depending on the backend, quantization type, KV cache settings, hardware, and benchmark configuration, but it is consistent with the recommendation to use RoPE/YaRN scaling for contexts beyond 32K.
1225
+
1226
+ Example llama.cpp configuration for extending from 32K to 128K:
1227
+
1228
+ ```bash
1229
+ ./llama-server \
1230
+ -m model.gguf \
1231
+ --ctx-size 131072 \
1232
+ --rope-scaling yarn \
1233
+ --rope-scale 4 \
1234
+ --yarn-orig-ctx 32768
1235
+ ```
1236
+
1237
+ For 256K context, users may need to adjust the scaling factor accordingly and validate the result in their own workload:
1238
+
1239
+ ```bash
1240
+ ./llama-server \
1241
+ -m model.gguf \
1242
+ --ctx-size 262144 \
1243
+ --rope-scaling yarn \
1244
+ --rope-scale 8 \
1245
+ --yarn-orig-ctx 32768
1246
+ ```
1247
+
1248
+ Please note that long-context behavior may vary depending on the inference backend, quantization type, KV cache settings, available memory, and task type. For best results, users should benchmark their own target workload when using contexts beyond 32K.
1249
+
1250
+ ---
1251
+
1252
  ## 🤝 8. Collaboration & Training Details
1253
 
1254
  This model is a collaborative milestone achieved with hardware engineer **Kyle Hessling**. You can follow him on X / Twitter: [@KyleHessling1](https://x.com/KyleHessling1) to keep up with the latest hardware infrastructure and distributed training updates. 🙏