Instructions to use SerFabio89/aretusa-2b-pretrained-32k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SerFabio89/aretusa-2b-pretrained-32k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SerFabio89/aretusa-2b-pretrained-32k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("SerFabio89/aretusa-2b-pretrained-32k") model = AutoModelForMultimodalLM.from_pretrained("SerFabio89/aretusa-2b-pretrained-32k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use SerFabio89/aretusa-2b-pretrained-32k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SerFabio89/aretusa-2b-pretrained-32k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SerFabio89/aretusa-2b-pretrained-32k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SerFabio89/aretusa-2b-pretrained-32k
- SGLang
How to use SerFabio89/aretusa-2b-pretrained-32k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SerFabio89/aretusa-2b-pretrained-32k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SerFabio89/aretusa-2b-pretrained-32k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SerFabio89/aretusa-2b-pretrained-32k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SerFabio89/aretusa-2b-pretrained-32k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SerFabio89/aretusa-2b-pretrained-32k with Docker Model Runner:
docker model run hf.co/SerFabio89/aretusa-2b-pretrained-32k
Aretusa-2B Pretrained 32k
Aretusa-2B Pretrained 32k is the pre-SFT continued-pretraining checkpoint of Aretusa-2B after the long-context curriculum stage.
This is not a chat model and is not instruction-tuned. It should be used as a base pretrained / continued-pretrained causal language model for further SFT, evaluation, or research.
Model type
This repository is published in native Hugging Face Transformers format as:
Qwen3ForCausalLM
The original Aretusa architecture is Llama-style but includes QK Norm, i.e. RMSNorm on Q and K before RoPE. Standard Llama does not natively include those Q/K norm weights, while Qwen3 does, so Qwen3 is used as the closest native Transformers-compatible target.
No trust_remote_code is required for the converted checkpoint.
Architecture summary
- Dense decoder-only causal LM
- ~2B parameters
- Hidden size: 2048
- Layers: 32
- Attention heads: 16
- KV heads: 4
- Head dimension: 128
- SwiGLU MLP
- RMSNorm pre-norm
- QK Norm
- RoPE with NTK scaling
- Max context: 32768 tokens
- Tokenizer vocab size: 65536
Training stage
This checkpoint corresponds to:
long_context_curriculum/stage3_32k/long_context/final_export
Configuration:
{
"model_type": "aretusa",
"vocab_size": 65536,
"d_model": 2048,
"n_layers": 32,
"n_heads": 16,
"n_kv_heads": 4,
"d_ff": 8192,
"max_seq_len": 32768,
"original_max_seq_len": 4096,
"rope_base": 500000.0,
"rope_scaling": {
"type": "ntk",
"factor": 8.0
},
"norm_eps": 1e-06,
"dtype": "bfloat16"
}
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "SerFabio89/aretusa-2b-pretrained-32k"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=False)
model = AutoModelForCausalLM.from_pretrained(
repo,
trust_remote_code=False,
dtype=torch.bfloat16,
device_map="auto",
)
prompt = "<|begin_of_text|>L'intelligenza artificiale è"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False)
inputs.pop("token_type_ids", None)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=80,
do_sample=False,
pad_token_id=tok.pad_token_id,
eos_token_id=tok.eos_token_id,
)
print(tok.decode(out[0], skip_special_tokens=False))
Important limitations
This is a pretrained base checkpoint, not a final assistant.
Expected limitations:
- Not aligned for chat
- Not optimized for instruction following
- May produce continuations rather than direct answers
- Requires SFT or another alignment stage for assistant use
Conversion note
The checkpoint was converted from the original Aretusa format to native Qwen3ForCausalLM format. The conversion preserves QK Norm tensors and folds static NTK RoPE scaling into the effective RoPE theta used by the native Transformers configuration.
- Downloads last month
- 4