--- license: apache-2.0 language: - en library_name: transformers base_model: - arcee-ai/AFM-4.5B-Base tags: - kda - kimi-delta-attention - linear-attention - nope - hybrid-attention - distillation - research --- # AFM-4.5B-Base-KDA-NoPE A hybrid attention variant of [AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base) combining Kimi Delta Attention (KDA) with NoPE (No Positional Encoding) full-attention layers in a 3:1 ratio. This architecture balances efficiency with performance through knowledge distillation. > ⚠️ **Research Model**: This is an experimental model released for research purposes. For production use, see [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B). More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it ## Overview Following the Kimi Linear architecture pattern, this model interleaves KDA layers with periodic full-attention layers (using NoPE) in a 3:1 ratio. This hybrid structure reduces memory and KV-cache usage while preserving global information flow via the full attention layers. **Key characteristics:** - 3:1 KDA to full-attention ratio - Full attention layers use NoPE (No Positional Encoding) - Trained up to 32k sequence length - Better short-context performance than pure KDA - Reduced memory footprint compared to full attention ## Architecture | Component | Details | |-----------|---------| | Parameters | 4.5B | | Attention Pattern | 1 Full Attn (NoPE) : 3 KDA | | Positional Encoding | NoPE on full attention layers | | Max Training Length | 32k tokens | | Base Model | AFM-4.5B-Base | ## Benchmark Results Performance compared to the teacher model and other configurations: | Benchmark | Teacher (Full Attn) | Hybrid (KDA-NoPE) | KDA-Only | |-----------|:-------------------:|:-----------------:|:--------:| | MMLU (Avg) | 63.1% | 55.1% | 55.8% | | ARC-Challenge | 55.6% | 48.5% | 49.9% | | HellaSwag (Norm) | 78.0% | 74.3% | 74.3% | | GSM8K (Math) | 52.1% | 36.5% | 26.8% | ### Key Findings - **Math advantage**: The hybrid recovers significantly more math performance (36.5%) than pure KDA (26.8%) - **Knowledge benchmarks**: Performs comparably to KDA-Only on MMLU, ARC, and HellaSwag - **Efficiency**: Maintains efficiency gains from KDA while preserving global reasoning via NoPE layers ## Long-Context Performance (NIAH) The hybrid model shows distinct long-context behavior: - 100% single-needle retrieval up to 32k - Sharp performance cliff past 32k training length - Near-zero performance beyond training context (vs. smooth degradation for KDA-Only) The NoPE full-attention layers appear responsible for the hard cutoff—they haven't seen positions beyond 32k during training. KDA layers generalize more naturally to longer sequences. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "arcee-ai/AFM-4.5B-Base-KDA-NoPE" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = "The theory of relativity states that" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) outputs = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.95 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details - **Method**: Knowledge distillation from AFM-4.5B-Base using [DistillKit](https://github.com/arcee-ai/DistillKit) - **Teacher**: AFM-4.5B-Base (full attention) - **Student Architecture**: Hybrid 3:1 KDA:NoPE - **Training Length**: 32k sequence length ## Comparison: Hybrid vs KDA-Only | Aspect | Hybrid (KDA-NoPE) | KDA-Only | |--------|:-----------------:|:--------:| | Math (GSM8K) | 36.5% ✓ | 26.8% | | Within-training NIAH | 100% | 100% | | Beyond-training behavior | Hard cliff | Smooth degradation | | Memory efficiency | ~75% reduction | ~100% reduction | Choose **Hybrid** for better short-context reasoning, especially math. Choose **KDA-Only** for more predictable long-context degradation. ## Intended Use This model is intended for: - Research into hybrid attention architectures - Studying linear/full attention tradeoffs - Exploring NoPE attention in hybrid configurations - Benchmarking efficiency vs. capability tradeoffs ## License AFM-4.5B is released under the Apache-2.0 license.