--- language: - pt license: mit tags: - bert - finance - sentiment-analysis - portuguese - financial-news - text-classification base_model: lucas-leme/FinBERT-PT-BR pipeline_tag: text-classification --- # PT-BR Financial Sentiment Analysis A fine-tuned BERT model for **sentiment classification of Brazilian Portuguese financial news**. Given a news headline or short text about the Brazilian financial market, the model classifies it as **POSITIVE**, **NEGATIVE**, or **NEUTRAL**. This model was developed as part of an undergraduate thesis (TCC) analysing sentiment trends in the Brazilian financial market from 2016 to 2025. --- ## Base Model This model is a fine-tuned version of [lucas-leme/FinBERT-PT-BR](https://huggingface.co/lucas-leme/FinBERT-PT-BR), a BERT model pre-trained on Brazilian Portuguese financial texts. --- ## Labels | ID | Label | Description | |----|----------|-----------------------------------------------------| | 0 | POSITIVE | News with a positive financial outlook or outcome | | 1 | NEGATIVE | News with a negative financial outlook or outcome | | 2 | NEUTRAL | News that is neither clearly positive nor negative | --- ## Training Details - **Architecture**: `BertForSequenceClassification` (12 layers, 768 hidden, 12 attention heads) - **Loss function**: Label-smoothed cross-entropy (`label_smoothing=0.1`) - **Epochs**: 4 - **Learning rate**: 7e-6 - **Weight decay**: 0.03 - **Class weighting**: Square-root balanced (to handle class imbalance) - **Post-hoc calibration**: Additive logit bias per class (`POSITIVE: -0.65`, `NEGATIVE: -0.20`, `NEUTRAL: 0.00`) - **Ensemble**: 2-seed ensemble (seeds 789 and 123) used during hyperparameter selection ### Dataset - **Total labeled examples**: 629 Brazilian financial news items (headlines and short summaries) - **Training split**: 402 examples - **Calibration split**: 101 examples (used for post-hoc bias calibration) - **Holdout split**: 126 examples (stratified 20%, seed=2026 — never seen during training or calibration) --- ## Evaluation Evaluated on a stratified holdout of **126 examples**: | Model | Accuracy | Macro F1 | |------------------------------|----------|----------| | Base (`FinBERT-PT-BR`) | 34.1% | 0.331 | | Fine-tuned (this model) | **64.3%** | **0.643** | The fine-tuned model achieves roughly **+30 pp accuracy** and **+0.31 macro F1** over the base model on this domain-specific holdout. --- ## Usage ```python from transformers import AutoTokenizer, BertForSequenceClassification import torch model_id = "lucasalmda/pt-br-financial-sentimental-analysis" tokenizer = AutoTokenizer.from_pretrained(model_id) model = BertForSequenceClassification.from_pretrained(model_id) model.eval() id2label = {0: "POSITIVE", 1: "NEGATIVE", 2: "NEUTRAL"} # Optional: apply the same logit biases used during calibration BIASES = {"POSITIVE": -0.65, "NEGATIVE": -0.20, "NEUTRAL": 0.00} bias_tensor = torch.tensor([BIASES["POSITIVE"], BIASES["NEGATIVE"], BIASES["NEUTRAL"]]) text = "Ibovespa fecha em alta com expectativa de corte na taxa Selic" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits calibrated_logits = logits + bias_tensor pred = calibrated_logits.argmax(dim=-1).item() print(id2label[pred]) # e.g. "POSITIVE" ``` --- ## Limitations - Trained on a relatively small labeled dataset (629 examples), so performance on edge cases may vary. - Optimised for **Brazilian Portuguese** financial news. It is not suited for general-purpose sentiment analysis or other languages. - The post-hoc calibration biases were selected on a held-out calibration split and may not generalise perfectly to all domains within Brazilian finance. - Lexically ambiguous headlines (e.g. "Selic cai" combined with negative macro context) remain the most common error pattern. --- ## Citation If you use this model, please cite the base model: ``` lucas-leme/FinBERT-PT-BR — https://huggingface.co/lucas-leme/FinBERT-PT-BR ```