Instructions to use adorosario/gemma3n-qa-v4-fixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adorosario/gemma3n-qa-v4-fixed with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="adorosario/gemma3n-qa-v4-fixed",
	filename="gemma3n-qa-v4-fixed-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = "{\n    \"question\": \"What is my name?\",\n    \"context\": \"My name is Clara and I live in Berkeley.\"\n}"
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use adorosario/gemma3n-qa-v4-fixed with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf adorosario/gemma3n-qa-v4-fixed:Q4_K_M

Use Docker

docker model run hf.co/adorosario/gemma3n-qa-v4-fixed:Q4_K_M

LM Studio
Jan
Ollama
How to use adorosario/gemma3n-qa-v4-fixed with Ollama:
```
ollama run hf.co/adorosario/gemma3n-qa-v4-fixed:Q4_K_M
```

Unsloth Studio

How to use adorosario/gemma3n-qa-v4-fixed with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adorosario/gemma3n-qa-v4-fixed to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adorosario/gemma3n-qa-v4-fixed to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for adorosario/gemma3n-qa-v4-fixed to start chatting

Atomic Chat new
Docker Model Runner
How to use adorosario/gemma3n-qa-v4-fixed with Docker Model Runner:
```
docker model run hf.co/adorosario/gemma3n-qa-v4-fixed:Q4_K_M
```

Lemonade

How to use adorosario/gemma3n-qa-v4-fixed with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull adorosario/gemma3n-qa-v4-fixed:Q4_K_M

Run and chat with the model

lemonade run user.gemma3n-qa-v4-fixed-Q4_K_M

List all available models

lemonade list

adorosario commited on Dec 26, 2025

Commit

bc749f0

verified ·

1 Parent(s): 9e0edce

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +337 -0

README.md ADDED Viewed

	@@ -0,0 +1,337 @@

+---
+language:
+  - en
+license: gemma
+library_name: gguf
+tags:
+  - gemma3n
+  - document-qa
+  - extractive-qa
+  - rag
+  - gguf
+  - ollama
+  - cpu-compatible
+  - no-hallucination
+  - abstention
+pipeline_tag: question-answering
+base_model: google/gemma-3n-E4B-it
+datasets:
+  - adorosario/gemma3n-qa-synthetic
+model-index:
+  - name: gemma3n-qa-v4-fixed
+    results:
+      - task:
+          type: question-answering
+          name: Document-Grounded QA
+        dataset:
+          name: SimpleQA-Verified Synthetic Test
+          type: custom
+        metrics:
+          - type: exact_match
+            value: 83.2
+            name: Exact Match
+          - type: f1
+            value: 90.0
+            name: Token F1
+          - type: f1
+            value: 98.9
+            name: Abstention F1
+---
+# gemma3n-qa-v4-fixed
+**A fine-tuned Gemma 3n model for document-grounded question answering that eliminates hallucination and knows when to say "I don't know."**
+| Metric | This Model | Baseline | Improvement |
+|--------|------------|----------|-------------|
+| Exact Match | **83.2%** | 22.0% | **+61.2 pts** |
+| Token F1 | **90.0%** | 34.8% | **+55.2 pts** |
+| Abstention F1 | **98.9%** | ~0% | **+98.9 pts** |
+## TL;DR
+This model answers questions **only** from provided context. When the answer isn't there, it says `NOT FOUND IN DOCUMENTS` instead of making things up.
+**The problem it solves:** The baseline Gemma 3n hallucinates answers not in the context. Ask "Who is the president of France?" with context about the Eiffel Tower, and baseline confidently says "Emmanuel Macron" - information it made up. This fine-tuned version correctly responds "NOT FOUND IN DOCUMENTS."
+---
+## Quick Start
+### With Ollama
+```bash
+# Download the model
+curl -L -o gemma3n-qa-v4-fixed.gguf https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf
+# Create Modelfile
+cat > Modelfile << 'EOF'
+FROM ./gemma3n-qa-v4-fixed.gguf
+TEMPLATE """<bos><start_of_turn>user
+{{ .System }}
+{{ .Prompt }}<end_of_turn>
+<start_of_turn>model
+{{ .Response }}<end_of_turn>"""
+PARAMETER stop <end_of_turn>
+PARAMETER stop <eos>
+PARAMETER temperature 0
+EOF
+# Create and run
+ollama create gemma3n-qa-v4-fixed -f Modelfile
+ollama run gemma3n-qa-v4-fixed
+```
+### Python API (Ollama)
+```python
+import requests
+def ask_document(question: str, context: str) -> str:
+    prompt = f"""You are a helpful assistant that answers questions based on provided context.
+If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
+Question: {question}
+Context:
+{context}"""
+    response = requests.post(
+        "http://localhost:11434/api/generate",
+        json={
+            "model": "gemma3n-qa-v4-fixed",
+            "prompt": prompt,
+            "stream": False
+        }
+    )
+    return response.json()["response"]
+# Example
+answer = ask_document(
+    question="When was the Eiffel Tower built?",
+    context="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel."
+)
+print(answer)  # Output: "from 1887 to 1889"
+```
+---
+## The Hallucination Problem (Why This Model Exists)
+### Baseline Behavior (Bad)
+```
+Question: Who is the president of France?
+Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.
+Baseline Response: "Emmanuel Macron"  ← HALLUCINATED! Not in context!
+```
+### Fine-tuned Behavior (Good)
+```
+Question: Who is the president of France?
+Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.
+Fine-tuned Response: "NOT FOUND IN DOCUMENTS"  ← Correct abstention!
+```
+This is critical for RAG applications where you need the model to be **honest about what it doesn't know**.
+---
+## Prompt Format (Required)
+The model requires this specific prompt format to work correctly:
+```
+You are a helpful assistant that answers questions based on provided context.
+If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
+Question: {your question}
+Context:
+{your context}
+```
+**Without the abstention instruction**, the model may not properly refuse to answer questions outside the context.
+---
+## Performance
+### Benchmark Results (6,046 test examples)
+| Metric | Value | Description |
+|--------|-------|-------------|
+| **Exact Match** | 83.2% | Answer exactly matches gold standard |
+| **Token F1** | 90.0% | Token overlap with gold answer |
+| **Abstention Precision** | 98.2% | When it abstains, it's correct |
+| **Abstention Recall** | 99.7% | It catches almost all unanswerable questions |
+| **Abstention F1** | 98.9% | Combined abstention performance |
+### Comparison with Baseline
+| Metric | Fine-tuned | Baseline (gemma3n:e4b) | Improvement |
+|--------|------------|------------------------|-------------|
+| Exact Match | 83.2% | 22.0% | +61.2 pts (+278%) |
+| Token F1 | 90.0% | 34.8% | +55.2 pts (+159%) |
+| Abstention F1 | 98.9% | ~0% | Model learned abstention |
+### Statistical Significance
+- **p-value**: < 0.00001 (highly significant)
+- **95% CI**: 82.3% - 84.1% (fine-tuned) vs 13.9% - 30.1% (baseline)
+- Confidence intervals don't overlap
+---
+## Hardware Requirements
+| Hardware | Supported | Latency | Notes |
+|----------|-----------|---------|-------|
+| **CPU only** (8 cores, 32GB RAM) | Yes | 4-6 sec | Validated on n2-standard-8 |
+| NVIDIA T4 (16GB) | Yes | <1 sec | Recommended |
+| Consumer GPU (8GB) | Yes | 1-2 sec | Works with Q4_K_M |
+| Apple Silicon | Yes | 1-3 sec | Via llama.cpp |
+**Memory requirement**: ~10 GB RAM for inference
+---
+## Training Details
+### Base Model
+- **Model**: Google Gemma 3n E4B (4B effective parameters)
+- **Source**: `unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit`
+### Fine-tuning Configuration
+| Parameter | Value |
+|-----------|-------|
+| Method | LoRA (Low-Rank Adaptation) |
+| Rank (r) | 32 |
+| Alpha | 64 |
+| Dropout | 0.05 |
+| Learning Rate | 2e-4 |
+| Epochs | 3 |
+| Batch Size | 4 (effective: 16 with grad accum) |
+| Precision | bfloat16 |
+| Training Time | ~20 hours on A100 40GB |
+### Training Data
+- **Dataset**: [adorosario/gemma3n-qa-synthetic](https://huggingface.co/datasets/adorosario/gemma3n-qa-synthetic)
+- **Size**: 57,081 examples (45,220 train / 5,815 val / 6,046 test)
+- **Composition**: 73% answerable QA, 27% abstention examples
+- **Source**: Synthetic generation from SimpleQA-Verified knowledge base
+- **Generation**: GPT-4o-mini
+- **Cost**: ~$15-20 USD
+### Critical Implementation Detail
+The v4 success came from **manual label masking** - training only on model responses, not on the prompt. Previous versions (v1, v3) failed because this wasn't properly implemented.
+---
+## How-To Guides
+### Use with llama.cpp
+```bash
+# Download
+wget https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf
+# Run
+./llama-cli -m gemma3n-qa-v4-fixed-q4_k_m.gguf \
+  -p "You are a helpful assistant...\n\nQuestion: ...\n\nContext:\n..." \
+  --temp 0
+```
+### Use in a RAG Pipeline
+```python
+from langchain.llms import Ollama
+llm = Ollama(model="gemma3n-qa-v4-fixed", temperature=0)
+def rag_query(question: str, retrieved_docs: list) -> str:
+    context = "\n\n".join(retrieved_docs)
+    prompt = f"""You are a helpful assistant that answers questions based on provided context.
+If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
+Question: {question}
+Context:
+{context}"""
+    return llm.invoke(prompt)
+```
+### Use with AnythingLLM
+1. Import the GGUF into Ollama (see Quick Start)
+2. In AnythingLLM, select `gemma3n-qa-v4-fixed` as the model
+3. Set system prompt to include the abstention instruction
+4. Set temperature to 0
+---
+## Limitations
+### What This Model Does Well
+- Extracting answers from provided context
+- Knowing when to abstain ("NOT FOUND IN DOCUMENTS")
+- Running on CPU-only hardware
+- Fast inference (4-6 seconds on CPU)
+### What This Model Does NOT Do
+- **Generate answers** beyond the context (by design)
+- **Multi-hop reasoning** requiring external knowledge
+- **Non-English languages** (trained on English only)
+- **Long contexts** beyond 4096 tokens
+- **Multi-turn conversation** (single-turn QA only)
+### Known Issues
+- Requires specific prompt format for abstention
+- ~2% quality loss from Q4_K_M quantization
+- May struggle with heavily paraphrased answers
+---
+## Files
+| File | Size | Description |
+|------|------|-------------|
+| `gemma3n-qa-v4-fixed-q4_k_m.gguf` | 7.68 GB | Main model (Q4_K_M quantization) |
+---
+## Citation
+```bibtex
+@misc{gemma3n-qa-v4-fixed-2025,
+  author = {Do Rosario, Alden},
+  title = {gemma3n-qa-v4-fixed: Fine-tuned Gemma 3n for Document-Grounded QA with Abstention},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/adorosario/gemma3n-qa-v4-fixed},
+  note = {Fine-tuned for extractive QA with learned abstention behavior}
+}
+```
+---
+## Related Resources
+- **Training Dataset**: [adorosario/gemma3n-qa-synthetic](https://huggingface.co/datasets/adorosario/gemma3n-qa-synthetic)
+- **Base Model**: [Google Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)
+- **Training Framework**: [Unsloth](https://github.com/unslothai/unsloth)
+---
+## Acknowledgments
+- Google for the Gemma 3n base model
+- Unsloth team for efficient fine-tuning tools
+- OpenAI for GPT-4o-mini used in synthetic data generation