--- language: - en license: gemma library_name: gguf tags: - gemma3n - document-qa - extractive-qa - rag - gguf - ollama - cpu-compatible - no-hallucination - abstention pipeline_tag: question-answering base_model: google/gemma-3n-E4B-it datasets: - adorosario/gemma3n-qa-synthetic model-index: - name: gemma3n-qa-v4-fixed results: - task: type: question-answering name: Document-Grounded QA dataset: name: SimpleQA-Verified Synthetic Test type: custom metrics: - type: exact_match value: 83.2 name: Exact Match - type: f1 value: 90.0 name: Token F1 - type: f1 value: 98.9 name: Abstention F1 --- # gemma3n-qa-v4-fixed **A fine-tuned Gemma 3n model for document-grounded question answering that eliminates hallucination and knows when to say "I don't know."** | Metric | This Model | Baseline | Improvement | |--------|------------|----------|-------------| | Exact Match | **83.2%** | 22.0% | **+61.2 pts** | | Token F1 | **90.0%** | 34.8% | **+55.2 pts** | | Abstention F1 | **98.9%** | ~0% | **+98.9 pts** | ## TL;DR This model answers questions **only** from provided context. When the answer isn't there, it says `NOT FOUND IN DOCUMENTS` instead of making things up. **The problem it solves:** The baseline Gemma 3n hallucinates answers not in the context. Ask "Who is the president of France?" with context about the Eiffel Tower, and baseline confidently says "Emmanuel Macron" - information it made up. This fine-tuned version correctly responds "NOT FOUND IN DOCUMENTS." --- ## Quick Start ### With Ollama ```bash # Download the model curl -L -o gemma3n-qa-v4-fixed.gguf https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf # Create Modelfile cat > Modelfile << 'EOF' FROM ./gemma3n-qa-v4-fixed.gguf TEMPLATE """user {{ .System }} {{ .Prompt }} model {{ .Response }}""" PARAMETER stop PARAMETER stop PARAMETER temperature 0 EOF # Create and run ollama create gemma3n-qa-v4-fixed -f Modelfile ollama run gemma3n-qa-v4-fixed ``` ### Python API (Ollama) ```python import requests def ask_document(question: str, context: str) -> str: prompt = f"""You are a helpful assistant that answers questions based on provided context. If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS". Question: {question} Context: {context}""" response = requests.post( "http://localhost:11434/api/generate", json={ "model": "gemma3n-qa-v4-fixed", "prompt": prompt, "stream": False } ) return response.json()["response"] # Example answer = ask_document( question="When was the Eiffel Tower built?", context="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel." ) print(answer) # Output: "from 1887 to 1889" ``` --- ## The Hallucination Problem (Why This Model Exists) ### Baseline Behavior (Bad) ``` Question: Who is the president of France? Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel. Baseline Response: "Emmanuel Macron" ← HALLUCINATED! Not in context! ``` ### Fine-tuned Behavior (Good) ``` Question: Who is the president of France? Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel. Fine-tuned Response: "NOT FOUND IN DOCUMENTS" ← Correct abstention! ``` This is critical for RAG applications where you need the model to be **honest about what it doesn't know**. --- ## Prompt Format (Required) The model requires this specific prompt format to work correctly: ``` You are a helpful assistant that answers questions based on provided context. If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS". Question: {your question} Context: {your context} ``` **Without the abstention instruction**, the model may not properly refuse to answer questions outside the context. --- ## Performance ### Benchmark Results (6,046 test examples) | Metric | Value | Description | |--------|-------|-------------| | **Exact Match** | 83.2% | Answer exactly matches gold standard | | **Token F1** | 90.0% | Token overlap with gold answer | | **Abstention Precision** | 98.2% | When it abstains, it's correct | | **Abstention Recall** | 99.7% | It catches almost all unanswerable questions | | **Abstention F1** | 98.9% | Combined abstention performance | ### Comparison with Baseline | Metric | Fine-tuned | Baseline (gemma3n:e4b) | Improvement | |--------|------------|------------------------|-------------| | Exact Match | 83.2% | 22.0% | +61.2 pts (+278%) | | Token F1 | 90.0% | 34.8% | +55.2 pts (+159%) | | Abstention F1 | 98.9% | ~0% | Model learned abstention | ### Statistical Significance - **p-value**: < 0.00001 (highly significant) - **95% CI**: 82.3% - 84.1% (fine-tuned) vs 13.9% - 30.1% (baseline) - Confidence intervals don't overlap --- ## Hardware Requirements | Hardware | Supported | Latency | Notes | |----------|-----------|---------|-------| | **CPU only** (8 cores, 32GB RAM) | Yes | 4-6 sec | Validated on n2-standard-8 | | NVIDIA T4 (16GB) | Yes | <1 sec | Recommended | | Consumer GPU (8GB) | Yes | 1-2 sec | Works with Q4_K_M | | Apple Silicon | Yes | 1-3 sec | Via llama.cpp | **Memory requirement**: ~10 GB RAM for inference --- ## Training Details ### Base Model - **Model**: Google Gemma 3n E4B (4B effective parameters) - **Source**: `unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit` ### Fine-tuning Configuration | Parameter | Value | |-----------|-------| | Method | LoRA (Low-Rank Adaptation) | | Rank (r) | 32 | | Alpha | 64 | | Dropout | 0.05 | | Learning Rate | 2e-4 | | Epochs | 3 | | Batch Size | 4 (effective: 16 with grad accum) | | Precision | bfloat16 | | Training Time | ~20 hours on A100 40GB | ### Training Data - **Dataset**: [adorosario/gemma3n-qa-synthetic](https://huggingface.co/datasets/adorosario/gemma3n-qa-synthetic) - **Size**: 57,081 examples (45,220 train / 5,815 val / 6,046 test) - **Composition**: 73% answerable QA, 27% abstention examples - **Source**: Synthetic generation from SimpleQA-Verified knowledge base - **Generation**: GPT-4o-mini - **Cost**: ~$15-20 USD ### Critical Implementation Detail The v4 success came from **manual label masking** - training only on model responses, not on the prompt. Previous versions (v1, v3) failed because this wasn't properly implemented. --- ## How-To Guides ### Use with llama.cpp ```bash # Download wget https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf # Run ./llama-cli -m gemma3n-qa-v4-fixed-q4_k_m.gguf \ -p "You are a helpful assistant...\n\nQuestion: ...\n\nContext:\n..." \ --temp 0 ``` ### Use in a RAG Pipeline ```python from langchain.llms import Ollama llm = Ollama(model="gemma3n-qa-v4-fixed", temperature=0) def rag_query(question: str, retrieved_docs: list) -> str: context = "\n\n".join(retrieved_docs) prompt = f"""You are a helpful assistant that answers questions based on provided context. If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS". Question: {question} Context: {context}""" return llm.invoke(prompt) ``` ### Use with AnythingLLM 1. Import the GGUF into Ollama (see Quick Start) 2. In AnythingLLM, select `gemma3n-qa-v4-fixed` as the model 3. Set system prompt to include the abstention instruction 4. Set temperature to 0 --- ## Limitations ### What This Model Does Well - Extracting answers from provided context - Knowing when to abstain ("NOT FOUND IN DOCUMENTS") - Running on CPU-only hardware - Fast inference (4-6 seconds on CPU) ### What This Model Does NOT Do - **Generate answers** beyond the context (by design) - **Multi-hop reasoning** requiring external knowledge - **Non-English languages** (trained on English only) - **Long contexts** beyond 4096 tokens - **Multi-turn conversation** (single-turn QA only) ### Known Issues - Requires specific prompt format for abstention - ~2% quality loss from Q4_K_M quantization - May struggle with heavily paraphrased answers --- ## Files | File | Size | Description | |------|------|-------------| | `gemma3n-qa-v4-fixed-q4_k_m.gguf` | 7.68 GB | Main model (Q4_K_M quantization) | --- ## Citation ```bibtex @misc{gemma3n-qa-v4-fixed-2025, author = {Do Rosario, Alden}, title = {gemma3n-qa-v4-fixed: Fine-tuned Gemma 3n for Document-Grounded QA with Abstention}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/adorosario/gemma3n-qa-v4-fixed}, note = {Fine-tuned for extractive QA with learned abstention behavior} } ``` --- ## Related Resources - **Training Dataset**: [adorosario/gemma3n-qa-synthetic](https://huggingface.co/datasets/adorosario/gemma3n-qa-synthetic) - **Base Model**: [Google Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n) - **Training Framework**: [Unsloth](https://github.com/unslothai/unsloth) --- ## Acknowledgments - Google for the Gemma 3n base model - Unsloth team for efficient fine-tuning tools - OpenAI for GPT-4o-mini used in synthetic data generation