lemonyins commited on
Commit
b2383ed
·
verified ·
1 Parent(s): 18142c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -61,7 +61,7 @@ llama-quantize.exe ^
61
  --tensor-type "blk.*.ffn_gate" iq3_s
62
  ```
63
 
64
- > **Note on TurboQuant**: This model requires a llama.cpp build with TurboQuant KV cache support. TurboQuant allows the KV cache to use a separate, more compact quantization format (turbo4 / turbo3), dramatically reducing memory usage even when the model weights themselves remain at IQ4_XS.
65
 
66
  ## Memory Performance (with TurboQuant KV Cache)
67
 
@@ -84,7 +84,7 @@ Tested on **NVIDIA RTX 4060 Ti 16GB**:
84
 
85
  | Scenario | Speed |
86
  | :--- | :--- |
87
- | Text-only inference | **16-20 tokens/s** |
88
 
89
  ### Vision Support (Optional)
90
 
 
61
  --tensor-type "blk.*.ffn_gate" iq3_s
62
  ```
63
 
64
+ > **Note on TurboQuant**: This model requires a llama.cpp (https://github.com/TheTom/llama-cpp-turboquant) build with TurboQuant KV cache support. TurboQuant allows the KV cache to use a separate, more compact quantization format (turbo4 / turbo3), dramatically reducing memory usage even when the model weights themselves remain at IQ4_XS. Of course, it is also possible to use vllm or other inference frameworks that support TurboQuant technology, but the author used llama.cpp for the test.
65
 
66
  ## Memory Performance (with TurboQuant KV Cache)
67
 
 
84
 
85
  | Scenario | Speed |
86
  | :--- | :--- |
87
+ | Text-only inference | **18-20 tokens/s** |
88
 
89
  ### Vision Support (Optional)
90