Instructions to use SerFabio89/aretusa-2b-pretrained-32k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SerFabio89/aretusa-2b-pretrained-32k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SerFabio89/aretusa-2b-pretrained-32k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("SerFabio89/aretusa-2b-pretrained-32k")
model = AutoModelForMultimodalLM.from_pretrained("SerFabio89/aretusa-2b-pretrained-32k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SerFabio89/aretusa-2b-pretrained-32k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SerFabio89/aretusa-2b-pretrained-32k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SerFabio89/aretusa-2b-pretrained-32k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SerFabio89/aretusa-2b-pretrained-32k

SGLang

How to use SerFabio89/aretusa-2b-pretrained-32k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SerFabio89/aretusa-2b-pretrained-32k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SerFabio89/aretusa-2b-pretrained-32k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SerFabio89/aretusa-2b-pretrained-32k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SerFabio89/aretusa-2b-pretrained-32k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use SerFabio89/aretusa-2b-pretrained-32k with Docker Model Runner:
```
docker model run hf.co/SerFabio89/aretusa-2b-pretrained-32k
```

Aretusa-2B Pretrained 32k

Aretusa-2B Pretrained 32k is the pre-SFT continued-pretraining checkpoint of Aretusa-2B after the long-context curriculum stage.

This is not a chat model and is not instruction-tuned. It should be used as a base pretrained / continued-pretrained causal language model for further SFT, evaluation, or research.

Model type

This repository is published in native Hugging Face Transformers format as:

Qwen3ForCausalLM

The original Aretusa architecture is Llama-style but includes QK Norm, i.e. RMSNorm on Q and K before RoPE. Standard Llama does not natively include those Q/K norm weights, while Qwen3 does, so Qwen3 is used as the closest native Transformers-compatible target.

No trust_remote_code is required for the converted checkpoint.

Architecture summary

Dense decoder-only causal LM
~2B parameters
Hidden size: 2048
Layers: 32
Attention heads: 16
KV heads: 4
Head dimension: 128
SwiGLU MLP
RMSNorm pre-norm
QK Norm
RoPE with NTK scaling
Max context: 32768 tokens
Tokenizer vocab size: 65536

Training stage

This checkpoint corresponds to:

long_context_curriculum/stage3_32k/long_context/final_export

Configuration:

{
  "model_type": "aretusa",
  "vocab_size": 65536,
  "d_model": 2048,
  "n_layers": 32,
  "n_heads": 16,
  "n_kv_heads": 4,
  "d_ff": 8192,
  "max_seq_len": 32768,
  "original_max_seq_len": 4096,
  "rope_base": 500000.0,
  "rope_scaling": {
    "type": "ntk",
    "factor": 8.0
  },
  "norm_eps": 1e-06,
  "dtype": "bfloat16"
}

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "SerFabio89/aretusa-2b-pretrained-32k"

tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=False,
    dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "<|begin_of_text|>L'intelligenza artificiale è"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False)
inputs.pop("token_type_ids", None)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=False,
        pad_token_id=tok.pad_token_id,
        eos_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0], skip_special_tokens=False))

Important limitations

This is a pretrained base checkpoint, not a final assistant.

Expected limitations:

Not aligned for chat
Not optimized for instruction following
May produce continuations rather than direct answers
Requires SFT or another alignment stage for assistant use

Conversion note

The checkpoint was converted from the original Aretusa format to native Qwen3ForCausalLM format. The conversion preserves QK Norm tensors and folds static NTK RoPE scaling into the effective RoPE theta used by the native Transformers configuration.

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16