---
license: apache-2.0
language:
  - en
library_name: transformers
base_model:
  - arcee-ai/AFM-4.5B-Base
tags:
  - kda
  - kimi-delta-attention
  - linear-attention
  - nope
  - hybrid-attention
  - distillation
  - research
---

# AFM-4.5B-Base-KDA-NoPE

A hybrid attention variant of [AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base) combining Kimi Delta Attention (KDA) with NoPE (No Positional Encoding) full-attention layers in a 3:1 ratio. This architecture balances efficiency with performance through knowledge distillation.

> ⚠️ **Research Model**: This is an experimental model released for research purposes. For production use, see [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B).

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

## Overview

Following the Kimi Linear architecture pattern, this model interleaves KDA layers with periodic full-attention layers (using NoPE) in a 3:1 ratio. This hybrid structure reduces memory and KV-cache usage while preserving global information flow via the full attention layers.

**Key characteristics:**
- 3:1 KDA to full-attention ratio
- Full attention layers use NoPE (No Positional Encoding)
- Trained up to 32k sequence length
- Better short-context performance than pure KDA
- Reduced memory footprint compared to full attention

## Architecture

| Component | Details |
|-----------|---------|
| Parameters | 4.5B |
| Attention Pattern | 1 Full Attn (NoPE) : 3 KDA |
| Positional Encoding | NoPE on full attention layers |
| Max Training Length | 32k tokens |
| Base Model | AFM-4.5B-Base |

## Benchmark Results

Performance compared to the teacher model and other configurations:

| Benchmark | Teacher (Full Attn) | Hybrid (KDA-NoPE) | KDA-Only |
|-----------|:-------------------:|:-----------------:|:--------:|
| MMLU (Avg) | 63.1% | 55.1% | 55.8% |
| ARC-Challenge | 55.6% | 48.5% | 49.9% |
| HellaSwag (Norm) | 78.0% | 74.3% | 74.3% |
| GSM8K (Math) | 52.1% | 36.5% | 26.8% |

### Key Findings

- **Math advantage**: The hybrid recovers significantly more math performance (36.5%) than pure KDA (26.8%)
- **Knowledge benchmarks**: Performs comparably to KDA-Only on MMLU, ARC, and HellaSwag
- **Efficiency**: Maintains efficiency gains from KDA while preserving global reasoning via NoPE layers

## Long-Context Performance (NIAH)

The hybrid model shows distinct long-context behavior:

- 100% single-needle retrieval up to 32k
- Sharp performance cliff past 32k training length
- Near-zero performance beyond training context (vs. smooth degradation for KDA-Only)

The NoPE full-attention layers appear responsible for the hard cutoff—they haven't seen positions beyond 32k during training. KDA layers generalize more naturally to longer sequences.

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-NoPE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

- **Method**: Knowledge distillation from AFM-4.5B-Base using [DistillKit](https://github.com/arcee-ai/DistillKit)
- **Teacher**: AFM-4.5B-Base (full attention)
- **Student Architecture**: Hybrid 3:1 KDA:NoPE
- **Training Length**: 32k sequence length

## Comparison: Hybrid vs KDA-Only

| Aspect | Hybrid (KDA-NoPE) | KDA-Only |
|--------|:-----------------:|:--------:|
| Math (GSM8K) | 36.5% ✓ | 26.8% |
| Within-training NIAH | 100% | 100% |
| Beyond-training behavior | Hard cliff | Smooth degradation |
| Memory efficiency | ~75% reduction | ~100% reduction |

Choose **Hybrid** for better short-context reasoning, especially math. Choose **KDA-Only** for more predictable long-context degradation.

## Intended Use

This model is intended for:
- Research into hybrid attention architectures
- Studying linear/full attention tradeoffs
- Exploring NoPE attention in hybrid configurations
- Benchmarking efficiency vs. capability tradeoffs

## License

AFM-4.5B is released under the Apache-2.0 license.