PKD-ALBERT is a lightweight financial sentiment classifier distilled from ProsusAI/finbert using a two-stage Patient Knowledge Distillation (PKD) pipeline. It classifies financial text — headlines, earnings excerpts, news snippets — into positive, neutral, or negative sentiment, achieving 97.4% accuracy and 0.965 Macro-F1 on Financial PhraseBank while using ~10× fewer parameters than the teacher model.

	Teacher (FinBERT)	Student (PKD-ALBERT)
Parameters	109.5M	11.7M
Model size	417.7 MB	44.6 MB
Test Accuracy	97.6%	97.4%
Macro F1	0.9696	0.9650
Inference (ms/doc)	1.49 ms	2.02 ms
Accuracy drop	—	−0.2%

> The student retains 99.8% of teacher accuracy at 89% less disk space.

Try It Live
API Usage
Distillation Approach

This model was trained using a two-stage Patient Knowledge Distillation strategy.

Stage 1 — Distillation on Pseudo-Labeled Financial News

The student (ALBERT-base) was first trained on a large corpus of scraped financial news pseudo-labeled by the FinBERT teacher, using a combined loss of soft KL divergence targets and intermediate layer alignment.

Parameter	Value
Dataset	Scraped financial news (pseudo-labeled by FinBERT)
Train / Val / Test split	5,587 / 1,197 / 1,198
Epochs	3
Batch size	32
Optimizer	AdamW
Learning rate	2e-5
KD temperature sweep	[2, 5, 9]
Alpha (KD loss weight)	0.3
PKD beta	0.02
PKD student layers	[2, 4, 8, 12]

Stage 2 — Fine-tuning on Financial PhraseBank

The distilled student was then fine-tuned on the high-quality Financial PhraseBank (100% annotator agreement) subset using standard cross-entropy loss to align the student with gold-label financial sentiment.

Parameter	Value
Dataset	Financial PhraseBank (100% agreement)
Train / Val / Test split	1,584 / 340 / 340
Epochs	1
Loss	Cross-entropy

Loss Function

The Stage 1 total loss combines:

KL Divergence between teacher soft targets and student logits (soft label transfer)
Patient KD alignment between intermediate ALBERT and FinBERT hidden layers
Alpha controls the balance between hard label CE loss and soft KD loss

Full Performance Comparison

The table below compares all training strategies evaluated against the same Financial PhraseBank test set (340 samples):

Model	Params	Size	Test Acc	Macro F1	KL (teacher→student)
Teacher FinBERT	109.5M	417.7 MB	97.6%	0.9696	—
Fresh → FP (baseline)	11.7M	44.6 MB	77.1%	0.6126	0.359
CE-scraped → FP	11.7M	44.6 MB	95.9%	0.9392	0.230
KD-scraped → FP	11.7M	44.6 MB	96.5%	0.9514	0.156
PKD-scraped → FP (ours)	11.7M	44.6 MB	97.4%	0.9650	0.188

Key takeaway: Patient KD achieves the highest Macro F1 among all student variants and closes to within 0.5% of the teacher — demonstrating that intermediate layer alignment significantly improves distillation quality beyond standard KD.

Intended Use
Limitations
Training Framework
Citation
Related Work

PKD-ALBERT: Lightweight Financial Sentiment Classifier via Patient Knowledge Distillation

Model Summary

PKD-ALBERT is a lightweight financial sentiment classifier distilled from ProsusAI/finbert using a two-stage Patient Knowledge Distillation (PKD) pipeline. It classifies financial text — headlines, earnings excerpts, news snippets — into positive, neutral, or negative sentiment, achieving 97.4% accuracy and 0.965 Macro-F1 on Financial PhraseBank while using ~10× fewer parameters than the teacher model.

Teacher (FinBERT) Student (PKD-ALBERT)

Parameters 109.5M 11.7M

Model size 417.7 MB 44.6 MB

Test Accuracy 97.6% 97.4%

Macro F1 0.9696 0.9650

Inference (ms/doc) 1.49 ms 2.02 ms

Accuracy drop — −0.2%

> The student retains 99.8% of teacher accuracy at 89% less disk space.

Try It Live

A live Gradio demo is available on Hugging Face Spaces. Paste any financial headline or sentence and receive a sentiment label with a confidence score.

👉 Open Live Demo →

Example inputs from the held-out test set:

Input	Prediction	Confidence
"Charles Schwab price target raised to $121 from $119 at JPMorgan."	✅ Positive	93.1%
"Costco assumed with a Peer Perform at Wolfe Research."	➖ Neutral	92.2%
"IBM Explains How AI Models Are Making a Familiar Human Mistake."	❌ Negative	90.1%

API Usage

The /predict endpoint accepts raw text and returns a label and confidence score.

import requests
 
API_URL = "https://hadangvu-pkd-sentiment-api.hf.space/predict"
 
response = requests.post(API_URL, json={
    "text": "Charles Schwab price target raised to $121 from $119 at JPMorgan."
})
 
print(response.json())
# {
#   "label": "positive",
#   "confidence": 0.93,
#   "latency_ms": 79.3
# }

Input: { "text": "..." } — any financial sentence or headline (max 128 tokens)

Output: { "label": str, "confidence": float, "latency_ms": float }

Distillation Approach

This model was trained using a two-stage Patient Knowledge Distillation strategy.
Stage 1 — Distillation on Pseudo-Labeled Financial News
The student (ALBERT-base) was first trained on a large corpus of scraped financial news pseudo-labeled by the FinBERT teacher, using a combined loss of soft KL divergence targets and intermediate layer alignment.

Parameter Value

Dataset Scraped financial news (pseudo-labeled by FinBERT)

Train / Val / Test split 5,587 / 1,197 / 1,198

Epochs 3

Batch size 32

Optimizer AdamW

Learning rate 2e-5

KD temperature sweep [2, 5, 9]

Alpha (KD loss weight) 0.3

PKD beta 0.02

PKD student layers [2, 4, 8, 12]

Stage 2 — Fine-tuning on Financial PhraseBank
The distilled student was then fine-tuned on the high-quality Financial PhraseBank (100% annotator agreement) subset using standard cross-entropy loss to align the student with gold-label financial sentiment.

Parameter Value

Dataset Financial PhraseBank (100% agreement)

Train / Val / Test split 1,584 / 340 / 340

Epochs 1

Loss Cross-entropy

Loss Function
The Stage 1 total loss combines:

KL Divergence between teacher soft targets and student logits (soft label transfer)

Patient KD alignment between intermediate ALBERT and FinBERT hidden layers

Alpha controls the balance between hard label CE loss and soft KD loss

Full Performance Comparison

The table below compares all training strategies evaluated against the same Financial PhraseBank test set (340 samples):

Model Params Size Test Acc Macro F1 KL (teacher→student)

Teacher FinBERT 109.5M 417.7 MB 97.6% 0.9696 —

Fresh → FP (baseline) 11.7M 44.6 MB 77.1% 0.6126 0.359

CE-scraped → FP 11.7M 44.6 MB 95.9% 0.9392 0.230

KD-scraped → FP 11.7M 44.6 MB 96.5% 0.9514 0.156

PKD-scraped → FP (ours) 11.7M 44.6 MB 97.4% 0.9650 0.188

Key takeaway: Patient KD achieves the highest Macro F1 among all student variants and closes to within 0.5% of the teacher — demonstrating that intermediate layer alignment significantly improves distillation quality beyond standard KD.

Intended Use

Designed for

Classifying sentiment in financial news headlines and short excerpts
Lightweight inference in resource-constrained environments (edge, serverless)
Research into compact NLP models for financial NLP tasks
Downstream integration into financial analytics pipelines

Not designed for

Long-form financial documents (more than 128 tokens per segment — chunk first)
Non-financial general text (model is domain-specialized)
High-stakes trading decisions without additional validation

Limitations

Domain-specific: Trained exclusively on financial text. Performance on general-domain sentiment will degrade.
Sequence length cap: Maximum 128 tokens per input. Longer documents should be chunked by sentence.
Label distribution: Financial PhraseBank skews neutral-heavy. Rare strongly negative or positive samples may receive lower confidence.
English only: Both training datasets are English. No multilingual support.

Training Framework

PyTorch with a custom training loop for KD and PKD loss computation
Hugging Face Transformers for model loading, tokenization, and checkpointing
Teacher model frozen throughout distillation; only student weights updated

Citation

If you use this model or the distillation pipeline in your work, please cite:

@misc{pkd-albert-finbert,
  author    = {Ha Dang Vu},
  title     = {PKD-ALBERT: Lightweight Financial Sentiment via Patient Knowledge Distillation},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/hadangvu/pkd-albert-student}
}

Related Work

ProsusAI/finbert — Teacher model
Patient Knowledge Distillation (Sun et al., 2019) — PKD method
Financial PhraseBank (Malo et al., 2014) — Evaluation dataset

Built as part of a financial NLP research project exploring efficient model compression for domain-specific sentiment analysis.

Downloads last month: 8

Safetensors

Model size

11.7M params

Tensor type

F32

Model tree for hadangvu/pkd-albert-student

Base model

albert/albert-base-v2

Finetuned

(265)

this model

Dataset used to train hadangvu/pkd-albert-student

Space using hadangvu/pkd-albert-student 1

Paper for hadangvu/pkd-albert-student

Patient Knowledge Distillation for BERT Model Compression

Paper • 1908.09355 • Published Aug 25, 2019

Evaluation results

Test Accuracy on Financial PhraseBank (100% agreement)
test set self-reported

0.974
Macro F1 on Financial PhraseBank (100% agreement)
test set self-reported

0.965