adorosario commited on
Commit
bc749f0
·
verified ·
1 Parent(s): 9e0edce

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +337 -0
README.md ADDED
@@ -0,0 +1,337 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: gemma
5
+ library_name: gguf
6
+ tags:
7
+ - gemma3n
8
+ - document-qa
9
+ - extractive-qa
10
+ - rag
11
+ - gguf
12
+ - ollama
13
+ - cpu-compatible
14
+ - no-hallucination
15
+ - abstention
16
+ pipeline_tag: question-answering
17
+ base_model: google/gemma-3n-E4B-it
18
+ datasets:
19
+ - adorosario/gemma3n-qa-synthetic
20
+ model-index:
21
+ - name: gemma3n-qa-v4-fixed
22
+ results:
23
+ - task:
24
+ type: question-answering
25
+ name: Document-Grounded QA
26
+ dataset:
27
+ name: SimpleQA-Verified Synthetic Test
28
+ type: custom
29
+ metrics:
30
+ - type: exact_match
31
+ value: 83.2
32
+ name: Exact Match
33
+ - type: f1
34
+ value: 90.0
35
+ name: Token F1
36
+ - type: f1
37
+ value: 98.9
38
+ name: Abstention F1
39
+ ---
40
+
41
+ # gemma3n-qa-v4-fixed
42
+
43
+ **A fine-tuned Gemma 3n model for document-grounded question answering that eliminates hallucination and knows when to say "I don't know."**
44
+
45
+ | Metric | This Model | Baseline | Improvement |
46
+ |--------|------------|----------|-------------|
47
+ | Exact Match | **83.2%** | 22.0% | **+61.2 pts** |
48
+ | Token F1 | **90.0%** | 34.8% | **+55.2 pts** |
49
+ | Abstention F1 | **98.9%** | ~0% | **+98.9 pts** |
50
+
51
+ ## TL;DR
52
+
53
+ This model answers questions **only** from provided context. When the answer isn't there, it says `NOT FOUND IN DOCUMENTS` instead of making things up.
54
+
55
+ **The problem it solves:** The baseline Gemma 3n hallucinates answers not in the context. Ask "Who is the president of France?" with context about the Eiffel Tower, and baseline confidently says "Emmanuel Macron" - information it made up. This fine-tuned version correctly responds "NOT FOUND IN DOCUMENTS."
56
+
57
+ ---
58
+
59
+ ## Quick Start
60
+
61
+ ### With Ollama
62
+
63
+ ```bash
64
+ # Download the model
65
+ curl -L -o gemma3n-qa-v4-fixed.gguf https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf
66
+
67
+ # Create Modelfile
68
+ cat > Modelfile << 'EOF'
69
+ FROM ./gemma3n-qa-v4-fixed.gguf
70
+ TEMPLATE """<bos><start_of_turn>user
71
+ {{ .System }}
72
+
73
+ {{ .Prompt }}<end_of_turn>
74
+ <start_of_turn>model
75
+ {{ .Response }}<end_of_turn>"""
76
+ PARAMETER stop <end_of_turn>
77
+ PARAMETER stop <eos>
78
+ PARAMETER temperature 0
79
+ EOF
80
+
81
+ # Create and run
82
+ ollama create gemma3n-qa-v4-fixed -f Modelfile
83
+ ollama run gemma3n-qa-v4-fixed
84
+ ```
85
+
86
+ ### Python API (Ollama)
87
+
88
+ ```python
89
+ import requests
90
+
91
+ def ask_document(question: str, context: str) -> str:
92
+ prompt = f"""You are a helpful assistant that answers questions based on provided context.
93
+ If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
94
+
95
+ Question: {question}
96
+
97
+ Context:
98
+ {context}"""
99
+
100
+ response = requests.post(
101
+ "http://localhost:11434/api/generate",
102
+ json={
103
+ "model": "gemma3n-qa-v4-fixed",
104
+ "prompt": prompt,
105
+ "stream": False
106
+ }
107
+ )
108
+ return response.json()["response"]
109
+
110
+ # Example
111
+ answer = ask_document(
112
+ question="When was the Eiffel Tower built?",
113
+ context="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel."
114
+ )
115
+ print(answer) # Output: "from 1887 to 1889"
116
+ ```
117
+
118
+ ---
119
+
120
+ ## The Hallucination Problem (Why This Model Exists)
121
+
122
+ ### Baseline Behavior (Bad)
123
+
124
+ ```
125
+ Question: Who is the president of France?
126
+ Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.
127
+
128
+ Baseline Response: "Emmanuel Macron" ← HALLUCINATED! Not in context!
129
+ ```
130
+
131
+ ### Fine-tuned Behavior (Good)
132
+
133
+ ```
134
+ Question: Who is the president of France?
135
+ Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.
136
+
137
+ Fine-tuned Response: "NOT FOUND IN DOCUMENTS" ← Correct abstention!
138
+ ```
139
+
140
+ This is critical for RAG applications where you need the model to be **honest about what it doesn't know**.
141
+
142
+ ---
143
+
144
+ ## Prompt Format (Required)
145
+
146
+ The model requires this specific prompt format to work correctly:
147
+
148
+ ```
149
+ You are a helpful assistant that answers questions based on provided context.
150
+ If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
151
+
152
+ Question: {your question}
153
+
154
+ Context:
155
+ {your context}
156
+ ```
157
+
158
+ **Without the abstention instruction**, the model may not properly refuse to answer questions outside the context.
159
+
160
+ ---
161
+
162
+ ## Performance
163
+
164
+ ### Benchmark Results (6,046 test examples)
165
+
166
+ | Metric | Value | Description |
167
+ |--------|-------|-------------|
168
+ | **Exact Match** | 83.2% | Answer exactly matches gold standard |
169
+ | **Token F1** | 90.0% | Token overlap with gold answer |
170
+ | **Abstention Precision** | 98.2% | When it abstains, it's correct |
171
+ | **Abstention Recall** | 99.7% | It catches almost all unanswerable questions |
172
+ | **Abstention F1** | 98.9% | Combined abstention performance |
173
+
174
+ ### Comparison with Baseline
175
+
176
+ | Metric | Fine-tuned | Baseline (gemma3n:e4b) | Improvement |
177
+ |--------|------------|------------------------|-------------|
178
+ | Exact Match | 83.2% | 22.0% | +61.2 pts (+278%) |
179
+ | Token F1 | 90.0% | 34.8% | +55.2 pts (+159%) |
180
+ | Abstention F1 | 98.9% | ~0% | Model learned abstention |
181
+
182
+ ### Statistical Significance
183
+
184
+ - **p-value**: < 0.00001 (highly significant)
185
+ - **95% CI**: 82.3% - 84.1% (fine-tuned) vs 13.9% - 30.1% (baseline)
186
+ - Confidence intervals don't overlap
187
+
188
+ ---
189
+
190
+ ## Hardware Requirements
191
+
192
+ | Hardware | Supported | Latency | Notes |
193
+ |----------|-----------|---------|-------|
194
+ | **CPU only** (8 cores, 32GB RAM) | Yes | 4-6 sec | Validated on n2-standard-8 |
195
+ | NVIDIA T4 (16GB) | Yes | <1 sec | Recommended |
196
+ | Consumer GPU (8GB) | Yes | 1-2 sec | Works with Q4_K_M |
197
+ | Apple Silicon | Yes | 1-3 sec | Via llama.cpp |
198
+
199
+ **Memory requirement**: ~10 GB RAM for inference
200
+
201
+ ---
202
+
203
+ ## Training Details
204
+
205
+ ### Base Model
206
+ - **Model**: Google Gemma 3n E4B (4B effective parameters)
207
+ - **Source**: `unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit`
208
+
209
+ ### Fine-tuning Configuration
210
+
211
+ | Parameter | Value |
212
+ |-----------|-------|
213
+ | Method | LoRA (Low-Rank Adaptation) |
214
+ | Rank (r) | 32 |
215
+ | Alpha | 64 |
216
+ | Dropout | 0.05 |
217
+ | Learning Rate | 2e-4 |
218
+ | Epochs | 3 |
219
+ | Batch Size | 4 (effective: 16 with grad accum) |
220
+ | Precision | bfloat16 |
221
+ | Training Time | ~20 hours on A100 40GB |
222
+
223
+ ### Training Data
224
+
225
+ - **Dataset**: [adorosario/gemma3n-qa-synthetic](https://huggingface.co/datasets/adorosario/gemma3n-qa-synthetic)
226
+ - **Size**: 57,081 examples (45,220 train / 5,815 val / 6,046 test)
227
+ - **Composition**: 73% answerable QA, 27% abstention examples
228
+ - **Source**: Synthetic generation from SimpleQA-Verified knowledge base
229
+ - **Generation**: GPT-4o-mini
230
+ - **Cost**: ~$15-20 USD
231
+
232
+ ### Critical Implementation Detail
233
+
234
+ The v4 success came from **manual label masking** - training only on model responses, not on the prompt. Previous versions (v1, v3) failed because this wasn't properly implemented.
235
+
236
+ ---
237
+
238
+ ## How-To Guides
239
+
240
+ ### Use with llama.cpp
241
+
242
+ ```bash
243
+ # Download
244
+ wget https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf
245
+
246
+ # Run
247
+ ./llama-cli -m gemma3n-qa-v4-fixed-q4_k_m.gguf \
248
+ -p "You are a helpful assistant...\n\nQuestion: ...\n\nContext:\n..." \
249
+ --temp 0
250
+ ```
251
+
252
+ ### Use in a RAG Pipeline
253
+
254
+ ```python
255
+ from langchain.llms import Ollama
256
+
257
+ llm = Ollama(model="gemma3n-qa-v4-fixed", temperature=0)
258
+
259
+ def rag_query(question: str, retrieved_docs: list) -> str:
260
+ context = "\n\n".join(retrieved_docs)
261
+ prompt = f"""You are a helpful assistant that answers questions based on provided context.
262
+ If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
263
+
264
+ Question: {question}
265
+
266
+ Context:
267
+ {context}"""
268
+ return llm.invoke(prompt)
269
+ ```
270
+
271
+ ### Use with AnythingLLM
272
+
273
+ 1. Import the GGUF into Ollama (see Quick Start)
274
+ 2. In AnythingLLM, select `gemma3n-qa-v4-fixed` as the model
275
+ 3. Set system prompt to include the abstention instruction
276
+ 4. Set temperature to 0
277
+
278
+ ---
279
+
280
+ ## Limitations
281
+
282
+ ### What This Model Does Well
283
+ - Extracting answers from provided context
284
+ - Knowing when to abstain ("NOT FOUND IN DOCUMENTS")
285
+ - Running on CPU-only hardware
286
+ - Fast inference (4-6 seconds on CPU)
287
+
288
+ ### What This Model Does NOT Do
289
+ - **Generate answers** beyond the context (by design)
290
+ - **Multi-hop reasoning** requiring external knowledge
291
+ - **Non-English languages** (trained on English only)
292
+ - **Long contexts** beyond 4096 tokens
293
+ - **Multi-turn conversation** (single-turn QA only)
294
+
295
+ ### Known Issues
296
+ - Requires specific prompt format for abstention
297
+ - ~2% quality loss from Q4_K_M quantization
298
+ - May struggle with heavily paraphrased answers
299
+
300
+ ---
301
+
302
+ ## Files
303
+
304
+ | File | Size | Description |
305
+ |------|------|-------------|
306
+ | `gemma3n-qa-v4-fixed-q4_k_m.gguf` | 7.68 GB | Main model (Q4_K_M quantization) |
307
+
308
+ ---
309
+
310
+ ## Citation
311
+
312
+ ```bibtex
313
+ @misc{gemma3n-qa-v4-fixed-2025,
314
+ author = {Do Rosario, Alden},
315
+ title = {gemma3n-qa-v4-fixed: Fine-tuned Gemma 3n for Document-Grounded QA with Abstention},
316
+ year = {2025},
317
+ publisher = {HuggingFace},
318
+ url = {https://huggingface.co/adorosario/gemma3n-qa-v4-fixed},
319
+ note = {Fine-tuned for extractive QA with learned abstention behavior}
320
+ }
321
+ ```
322
+
323
+ ---
324
+
325
+ ## Related Resources
326
+
327
+ - **Training Dataset**: [adorosario/gemma3n-qa-synthetic](https://huggingface.co/datasets/adorosario/gemma3n-qa-synthetic)
328
+ - **Base Model**: [Google Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)
329
+ - **Training Framework**: [Unsloth](https://github.com/unslothai/unsloth)
330
+
331
+ ---
332
+
333
+ ## Acknowledgments
334
+
335
+ - Google for the Gemma 3n base model
336
+ - Unsloth team for efficient fine-tuning tools
337
+ - OpenAI for GPT-4o-mini used in synthetic data generation