Aretusa-2B Pretrained 32k

Aretusa-2B Pretrained 32k is the pre-SFT continued-pretraining checkpoint of Aretusa-2B after the long-context curriculum stage.

This is not a chat model and is not instruction-tuned. It should be used as a base pretrained / continued-pretrained causal language model for further SFT, evaluation, or research.

Model type

This repository is published in native Hugging Face Transformers format as:

Qwen3ForCausalLM

The original Aretusa architecture is Llama-style but includes QK Norm, i.e. RMSNorm on Q and K before RoPE. Standard Llama does not natively include those Q/K norm weights, while Qwen3 does, so Qwen3 is used as the closest native Transformers-compatible target.

No trust_remote_code is required for the converted checkpoint.

Architecture summary

  • Dense decoder-only causal LM
  • ~2B parameters
  • Hidden size: 2048
  • Layers: 32
  • Attention heads: 16
  • KV heads: 4
  • Head dimension: 128
  • SwiGLU MLP
  • RMSNorm pre-norm
  • QK Norm
  • RoPE with NTK scaling
  • Max context: 32768 tokens
  • Tokenizer vocab size: 65536

Training stage

This checkpoint corresponds to:

long_context_curriculum/stage3_32k/long_context/final_export

Configuration:

{
  "model_type": "aretusa",
  "vocab_size": 65536,
  "d_model": 2048,
  "n_layers": 32,
  "n_heads": 16,
  "n_kv_heads": 4,
  "d_ff": 8192,
  "max_seq_len": 32768,
  "original_max_seq_len": 4096,
  "rope_base": 500000.0,
  "rope_scaling": {
    "type": "ntk",
    "factor": 8.0
  },
  "norm_eps": 1e-06,
  "dtype": "bfloat16"
}

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "SerFabio89/aretusa-2b-pretrained-32k"

tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=False,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "<|begin_of_text|>L'intelligenza artificiale è"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False)
inputs.pop("token_type_ids", None)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=False,
        pad_token_id=tok.pad_token_id,
        eos_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0], skip_special_tokens=False))

Important limitations

This is a pretrained base checkpoint, not a final assistant.

Expected limitations:

  • Not aligned for chat
  • Not optimized for instruction following
  • May produce continuations rather than direct answers
  • Requires SFT or another alignment stage for assistant use

Conversion note

The checkpoint was converted from the original Aretusa format to native Qwen3ForCausalLM format. The conversion preserves QK Norm tensors and folds static NTK RoPE scaling into the effective RoPE theta used by the native Transformers configuration.

Downloads last month
4
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support