birgermoell/saga-swedish-health-linear-probe
This repository is an example model card for Assignment 1 in the 5LN712 Information Retrieval course. It shows one complete submission pattern: define a domain problem, create a custom dataset, encode text with embeddings, train classifiers, evaluate them on a held-out split, and deploy a working Hugging Face demo.
Assignment fit
- Domain issue: health-information retrieval systems often need to route short texts to an appropriate source family before search, filtering, or question answering.
- Embedding challenge: all labels discuss health, but they differ in audience, vocabulary, genre, and institutional purpose.
- Embedding model:
nicher92/saga-embed_v1. - Classifier: regularized logistic regression trained on frozen embeddings.
- Deliverables represented here: custom dataset, trained model, metrics, model card, and Gradio Space.
Current example project
The example task is Swedish Health Source Triage. Given a short health-information text, the system predicts the most likely source family:
1177.sesocialstyrelsen.selakemedelsverket.se
The labels are intentionally source-oriented rather than diagnosis-oriented. That keeps the project aligned with Information Retrieval: routing, source selection, filtering, collection analysis, and evaluation.
Links
- Dataset: https://huggingface.co/datasets/birgermoell/swedish-health-source-triage
- Demo Space: https://huggingface.co/spaces/birgermoell/swedish-health-source-triage-demo
- Embedding model: https://huggingface.co/nicher92/saga-embed_v1
Files in this model repository
model.joblib: fitted scikit-learn classifier.resolved_config.yaml: resolved training configuration.metrics.json: machine-readable evaluation metrics.predictions.csv: held-out predictions for inspection.report.md: short generated training report.embedding_classifier_pipeline_explainer.pdf: student-facing explanation of how embeddings and the classifier work together.
Evaluation
- Test size:
0.25 - Random seed:
712 - Accuracy:
1.0 - Macro F1:
1.0 - Weighted F1:
1.0
The current dataset is small and deliberately clean, so high scores should be read as a successful pipeline demonstration rather than proof of a robust medical or public-sector classifier. A stronger student submission should expand the dataset, include harder negative examples, and discuss errors.
Reproduce the run
saga-ir inspect --config configs/assignment1_swedish_health.yaml
saga-ir linear-probe --config configs/assignment1_swedish_health.yaml --batch-size 8 --device cpu
saga-ir prototypes --config configs/assignment1_swedish_health.yaml --shots-per-source 4 --batch-size 8 --device cpu
saga-ir one-vs-rest --config configs/assignment1_swedish_health.yaml \
--target-sources 1177.se \
--target-sources socialstyrelsen.se \
--target-sources lakemedelsverket.se \
--batch-size 8 --device cpu
These runs are designed to work on a 32 GB MacBook. The frozen-embedding methods are the recommended baseline because they are fast, inspectable, and easy to explain.
What students should copy from this example
- State the problem before training the model.
- Separate the embedding model from the classifier trained for the assignment.
- Publish a custom dataset, not only code.
- Save metrics and prediction examples.
- Link the dataset, model, Space, GitHub repo, and report.
- Explain limitations honestly, especially when a dataset is small.
- Reflect on what AI tools helped with and what was manually checked.
Downloadable student explainer
A longer PDF explanation is included in this repository:
Demo inputs to try
- Patient guidance: fever, back pain, pollen allergy, sleep problems.
- Authority/statistics: national indicators, regional comparisons, guidelines.
- Medicine regulation: recalls, product information updates, safety warnings.
Intended use and limits
This model is for teaching embeddings-based text classification in an Information Retrieval course. It is not medical advice, not a clinical triage system, and not a replacement for human review of health information.
Model tree for birgermoell/saga-swedish-health-linear-probe
Base model
answerdotai/ModernBERT-base