---
base_model:
- meta-llama/Llama-3.1-8B-Instruct
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---

# Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

[Flux Attention](https://arxiv.org/abs/2604.07394) is a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the method adaptively routes each layer to Full Attention (FA) or Sparse Attention (SA) based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, resulting in significant wall-clock speedups during both prefill and decoding stages.

- **Project Page:** [https://qqtang-code.github.io/FluxAttention-Project-Page/](https://qqtang-code.github.io/FluxAttention-Project-Page/)
- **GitHub Repository:** [https://github.com/qqtang-code/FluxAttention](https://github.com/qqtang-code/FluxAttention)
- **Paper:** [arxiv.org/abs/2604.07394](https://arxiv.org/abs/2604.07394)

## Quick Start (Inference)

Below is a minimal example of how to use Flux Attention for text generation. Note that this requires the `fluxattn` package and dependencies (like `Block-Sparse-Attention`) to be installed as described in the [GitHub repository](https://github.com/qqtang-code/FluxAttention).

```python
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM

def load_sparse_model(model_path):
    """
    Dynamically loads the correct sparse architecture based on config.
    """
    config_path = f"{model_path}/config.json"
    with open(config_path, "r") as f:
        config_data = json.load(f)

    arch = config_data.get("architectures", [])
    if not arch:
        raise ValueError("No architecture found in config.json")

    arch_name = arch[0]
    print(f"🚀 Detected architecture: {arch_name}")

    # Register custom architectures
    if "PawLlama" in arch_name:
        from fluxattn.training.eval.modeling_flash_llama import (
            PawLlamaForCausalLM, PawLlamaConfig
        )
        AutoModelForCausalLM.register(PawLlamaConfig, PawLlamaForCausalLM)
        model_cls = PawLlamaForCausalLM
        
    elif "PawQwen" in arch_name:
        from fluxattn.training.eval.modeling_flash_qwen import (
            PawQwen3ForCausalLM, PawQwen3Config
        )
        AutoModelForCausalLM.register(PawQwen3Config, PawQwen3ForCausalLM)
        model_cls = PawQwen3ForCausalLM
    else:
        raise ValueError(f"Unsupported architecture: {arch_name}")

    # Load model
    model = model_cls.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )
    return model

# --- Execution ---
model_path = "QQTang1223/Flux-Attention-Llama-3.1-8B-Instruct" 
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

print("Loading Flux Attention Model...")
model = load_sparse_model(model_path)
model.eval()

# Generate
input_text = "Explain quantum mechanics in one sentence."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

print("Generating...")
outputs = model.generate(**inputs, max_new_tokens=100)
print("
Output:
" + tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Citation

```bibtex
@misc{qiu2026fluxattentioncontextawarehybrid,
      title={Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference}, 
      author={Quantong Qiu and Zhiyi Hong and Yi Yang and Haitian Wang and Kebin Liu and Qingqing Dang and Juntao Li and Min Zhang},
      year={2026},
      eprint={2604.07394},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.07394}, 
}
```