LlamaEld-3.1-8B-Instruct

Model Overview

LlamaEld-3.1-8B-Instruct is a Portuguese-focused instruction-tuned large language model derived from Llama-3.1-8B-Instruct. It has been further fine-tuned using curated Portuguese-language text extracted and filtered from the ClueWeb22 dataset.

This model aims to improve performance on Portuguese understanding, instruction following, and generation tasks, particularly in real-world web text domains.


Model Details

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Architecture: Transformer (decoder-only)
  • Parameters: ~8 billion
  • Fine-tuning Type: Supervised fine-tuning (SFT)
  • Primary Language: Portuguese
  • License: Same as base model (Llama 3.1 license—verify compliance before use)

Training Data

The model was fine-tuned using:

  • Source: ClueWeb22
  • Subset: Portuguese-language documents
  • Processing:
    • Language filtering (Portuguese only)
    • Quality filtering and cleaning
    • Removal of noisy or low-quality web content

Data Characteristics

ClueWeb22 is a large-scale web crawl dataset, meaning:

  • It contains diverse and heterogeneous content
  • It may include biases, inaccuracies, and outdated information
  • Despite filtering, some noise may persist

Intended Use

Suitable for:

  • Portuguese text generation and completion
  • Instruction following in Portuguese
  • Chatbots and conversational AI
  • Summarization and rewriting tasks
  • Question answering (general domain)

Not Recommended for:

  • High-stakes decision-making (legal, medical, financial)
  • Tasks requiring guaranteed factual accuracy
  • Sensitive or regulated applications without additional safeguards

Usage

Example (Transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ai-eldorado/LlamaEld-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = "Explique o impacto da inteligência artificial na educação."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation

No standardized benchmark results are currently provided.

However, expected improvements include:

  • Better fluency in Portuguese
  • Stronger adherence to instructions in Portuguese
  • Improved handling of web-style content and vocabulary

For production use, we recommend performing task-specific evaluation.


Limitations

  • May generate incorrect or hallucinated information
  • Sensitive to prompt phrasing
  • Bias inherited from web data (ClueWeb22)
  • Performance in non-Portuguese languages may be degraded
  • Not optimized for reasoning-heavy tasks compared to larger models

Safety and Bias

Because the model is trained on web data:

  • It may reproduce harmful stereotypes or biases
  • It may generate inappropriate or misleading content

Recommended Mitigations:

  • Add moderation layers
  • Use prompt engineering for safer outputs
  • Apply human-in-the-loop validation in critical systems
Downloads last month
17
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ai-eldorado/LlamaEld-3.1-8B-Instruct

Finetuned
(2787)
this model