Gemma 4 E4B — Social-Bias Judge (SFT only)

This is the SFT-only checkpoint from the judge-from-scratch project. It is the intermediate artifact before the DPO refinement pass that produced krishnakartik/gemma4-social-bias-judge (the primary release).

Use this checkpoint instead of the DPO version if your bias categories are out-of-distribution relative to BBQ's training set. The DPO refinement narrows generalization by overfitting to the 10 in-distribution bias categories' specific patterns — fine when your inputs match the training distribution, harmful when they don't.

For the full project narrative, eval methodology, training pipeline, and limitations, read the primary model card. This card focuses on what differs between the SFT-only and DPO checkpoints.


⚠️ Important: Thinking Mode

This model was fine-tuned with Gemma 4's native thinking mode DISABLED. Do NOT include <|think|> in the system prompt at inference time — the model never saw that token during training and will generate degraded, unparseable output. See the primary model card's thinking-mode section for the full explanation.


Quick start

Ollama

# IMPORTANT: thinking mode is disabled — do NOT add <|think|> to /system.
ollama run hf.co/krishnakartik/gemma4-social-bias-judge-gguf:Q8_0-sft

Python (transformers)

# Identical usage to the DPO checkpoint — only the model_id changes.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "krishnakartik/gemma4-social-bias-judge-sft"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
# ... see primary model card for the full inference snippet.

When to choose this over the DPO checkpoint

Use case Recommended
Bias categories in BBQ's 10 trained set (age, disability, gender identity, nationality, physical appearance, race/ethnicity inc. intersectional, religion, sexual orientation, SES) DPO (primary)
Bias categories outside the trained set (politics, ideology, novel demographic axes, intersectional categories not in training) This checkpoint (SFT)
Tie-case detection (both responses clean) is critical DPO — tie-κ jumps from −0.06 (SFT) to 0.36 (DPO)
Subtle bias discrimination on in-dist data DPO — subtle-κ jumps from 0.74 (SFT) to 0.89 (DPO)
Tracked-vs-alternate (which specific stereotype is invoked) This checkpoint (SFT-κ 0.20 vs DPO-κ 0.12)
Position-bias robustness on OOD This checkpoint (SFT 11.7% vs DPO 16.7%)

Eval results (selected)

Same 300-pair holdout, same vLLM/bf16 backend as the primary model card's eval table.

Metric Base SFT (this) DPO
Overall κ (in-dist) 0.481 0.647 0.682
Overall κ (OOD religion) 0.542 0.695 0.643
Tracked-vs-alternate κ 0.145 0.197 0.119
Subtle cases κ 0.632 0.743 0.890
Tie cases κ 0.202 −0.056 0.359
Position-bias rate (OOD) 21.7% 11.7% 16.7%
Self-consistency (T=0.3) 73.7% 83.2% 82.7%

This checkpoint wins on OOD κ, tracked-vs-alternate κ, and OOD position-bias. The DPO checkpoint wins on in-dist κ, subtle cases, and tie cases — the metrics where the synth-hard-negatives training shape was specifically designed to help.

The OOD-κ delta (+0.052 in this checkpoint's favor) is the load-bearing reason this artifact exists. See the primary model card's OOD-regression discussion for the full analysis.


Training summary

QLoRA SFT: 3,844 rows (1,938 base pairs × position-swap doubling), 3 epochs, 720 optimizer steps, r=16, α=32, dropout=0, all-linear LoRA targets, lr=2e-4 cosine, peak VRAM 23.4 GB on A100-40GB. Final train_loss 0.889, mean_token_accuracy 86.1%. Total Stage 6 spend: ~$4. Adapter merged to bf16 for Stage 8 eval and this release.

The DPO step was applied to a copy of this checkpoint (not gated by this checkpoint's existence), so the SFT artifact is the same one that fed into DPO — it's a checkpoint snapshot of the pipeline, unmodified.


License & citation

Same as the primary model card.

Downloads last month
4
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for krishnakartik/gemma4-social-bias-judge-sft

Finetuned
(224)
this model

Dataset used to train krishnakartik/gemma4-social-bias-judge-sft

Evaluation results

  • Cohen's κ (in-distribution, 240 pairs) on Gemma 4 Social Bias Judge Pairs (eval holdout)
    self-reported
    0.647
  • Cohen's κ (OOD religion, 60 pairs) on Gemma 4 Social Bias Judge Pairs (eval holdout)
    self-reported
    0.695
  • Cohen's κ (tracked-vs-alternate) on Gemma 4 Social Bias Judge Pairs (eval holdout)
    self-reported
    0.197
  • Cohen's κ (subtle-bias bucket) on Gemma 4 Social Bias Judge Pairs (eval holdout)
    self-reported
    0.743
  • Position-bias rate (in-distribution; lower is better) on Gemma 4 Social Bias Judge Pairs (eval holdout)
    self-reported
    0.084
  • Self-consistency rate (T=0.3) on Gemma 4 Social Bias Judge Pairs (eval holdout)
    self-reported
    0.832