Vikras-MixP / README.md
srs6901's picture
Upload README.md
9f618d1 verified
|
Raw
History Blame
1.83 kB
metadata
library_name: transformers
tags:
  - quantized
  - hybrid
language:
  - ru
  - en

Vikra MixedPrc

12.25B parameter Mistral-based language model with mixed-precision hybrid quantization.

Model Details

Property Value
Architecture Mistral (12.25B params, 40 layers)
Hidden size 5120
Attention heads 32 (8 KV heads, GQA)
Intermediate size 14336
Context length 1,024,000 tokens
Vocabulary 131,072 tokens (Tekken BPE)
RoPE theta 1,000,000.0

Quantization: MixP_4.9b_S

Custom mixed-precision quantization scheme with per-tensor type assignment.

Tensor group Quant type BPW
token_embd, output BF16 16.00
attn_norm, ffn_norm, output_norm F32 32.00
attn_q Q4_K 4.50
attn_k Q5_K 5.50
attn_v Q3_K 3.44
attn_output Q4_K 4.50
ffn_gate Q3_K 3.44
ffn_up Q5_K 5.50
ffn_down Q5_K / Q6_K (last layers) 5.50–6.56

Overall: 6.11 BPW | Quantized layers only: 4.89 BPW | File size: 8.71 GB

Perplexity

Measured on wikitext-2-raw-test (full dataset, 73 chunks, context 4096 tokens, 299,008 tokens evaluated):

Model Precision Size PPL
Vikra MixP_4.9b_S 6.11 BPW 8.71 GB 5.5000 ± 0.032
Vikhr-Nemo-12B-Instruct (baseline) BF16 22.81 GB 6.0212 ± 0.034

Chat Template

Built-in chat template (baked into GGUF):

<|start_header_id|>system<|end_header_id|>
{system_message}</s><|start_header_id|>user<|end_header_id|>
{user_message}</s><|start_header_id|>assistant<|end_header_id|>

Usage

# llama.cpp server
llama-server -m Vikra-MixP_4.9b_S.gguf -ngl 25 -c 4096

# llama.cpp CLI
llama-cli -m Vikra-MixP_4.9b_S.gguf -ngl 25 -c 4096 -cnv