birgermoell/saga-swedish-health-linear-probe

This repository is an example model card for Assignment 1 in the 5LN712 Information Retrieval course. It shows one complete submission pattern: define a domain problem, create a custom dataset, encode text with embeddings, train classifiers, evaluate them on a held-out split, and deploy a working Hugging Face demo.

Assignment fit

  • Domain issue: health-information retrieval systems often need to route short texts to an appropriate source family before search, filtering, or question answering.
  • Embedding challenge: all labels discuss health, but they differ in audience, vocabulary, genre, and institutional purpose.
  • Embedding model: nicher92/saga-embed_v1.
  • Classifier: regularized logistic regression trained on frozen embeddings.
  • Deliverables represented here: custom dataset, trained model, metrics, model card, and Gradio Space.

Current example project

The example task is Swedish Health Source Triage. Given a short health-information text, the system predicts the most likely source family:

  • 1177.se
  • socialstyrelsen.se
  • lakemedelsverket.se

The labels are intentionally source-oriented rather than diagnosis-oriented. That keeps the project aligned with Information Retrieval: routing, source selection, filtering, collection analysis, and evaluation.

Links

Files in this model repository

  • model.joblib: fitted scikit-learn classifier.
  • resolved_config.yaml: resolved training configuration.
  • metrics.json: machine-readable evaluation metrics.
  • predictions.csv: held-out predictions for inspection.
  • report.md: short generated training report.
  • embedding_classifier_pipeline_explainer.pdf: student-facing explanation of how embeddings and the classifier work together.

Evaluation

  • Test size: 0.25
  • Random seed: 712
  • Accuracy: 1.0
  • Macro F1: 1.0
  • Weighted F1: 1.0

The current dataset is small and deliberately clean, so high scores should be read as a successful pipeline demonstration rather than proof of a robust medical or public-sector classifier. A stronger student submission should expand the dataset, include harder negative examples, and discuss errors.

Reproduce the run

saga-ir inspect --config configs/assignment1_swedish_health.yaml
saga-ir linear-probe --config configs/assignment1_swedish_health.yaml --batch-size 8 --device cpu
saga-ir prototypes --config configs/assignment1_swedish_health.yaml --shots-per-source 4 --batch-size 8 --device cpu
saga-ir one-vs-rest --config configs/assignment1_swedish_health.yaml \
  --target-sources 1177.se \
  --target-sources socialstyrelsen.se \
  --target-sources lakemedelsverket.se \
  --batch-size 8 --device cpu

These runs are designed to work on a 32 GB MacBook. The frozen-embedding methods are the recommended baseline because they are fast, inspectable, and easy to explain.

What students should copy from this example

  • State the problem before training the model.
  • Separate the embedding model from the classifier trained for the assignment.
  • Publish a custom dataset, not only code.
  • Save metrics and prediction examples.
  • Link the dataset, model, Space, GitHub repo, and report.
  • Explain limitations honestly, especially when a dataset is small.
  • Reflect on what AI tools helped with and what was manually checked.

Downloadable student explainer

A longer PDF explanation is included in this repository:

Demo inputs to try

  • Patient guidance: fever, back pain, pollen allergy, sleep problems.
  • Authority/statistics: national indicators, regional comparisons, guidelines.
  • Medicine regulation: recalls, product information updates, safety warnings.

Intended use and limits

This model is for teaching embeddings-based text classification in an Information Retrieval course. It is not medical advice, not a clinical triage system, and not a replacement for human review of health information.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for birgermoell/saga-swedish-health-linear-probe

Finetuned
(1)
this model

Dataset used to train birgermoell/saga-swedish-health-linear-probe

Space using birgermoell/saga-swedish-health-linear-probe 1