---
license: mit
language:
- en
tags:
- code
- finance
datasets:
- mlabonne/FineTome-100k
- leeroy-jankins/Regulations
- leeroy-jankins/Appropriations
- leeroy-jankins/OMB-Circular-A-11
- leeroy-jankins/RedBook
- leeroy-jankins/SF133
- leeroy-jankins/US-General-Ledger
- leeroy-jankins/Title-31-CFR-Money-and-Finance
base_model:
- unsloth/gemma-3-1b-it-GGUF
pipeline_tag: text-generation
metrics:
- accuracy
---
## π― Overview
**Bro** is a LLM fine-tuned variant of the `gemma-3-1b-it` transformer model, optimized for enhanced contextual comprehension, instruction following, and domain-specific reasoning. The fine-tuning process used supervised instruction tuning across multiple NLP domains, with a focus on factual recall, multi-step reasoning, and document comprehension.
Built on the lightweight yet powerful `Gemma 3 1B` architecture, **Bro** provides a balance between inference speed and linguistic depth β making it suitable for both production deployment and academic research.
## π Use with Streamlit
## π§© Code Repository
## βοΈ Vectorized Datasets
> Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned.
> It can help improve the execution speed and reduce the training time of your code.
> BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- [Appropriations](https://huggingface.co/datasets/leeroy-jankins/Appropriations) - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- [Regulations](https://huggingface.co/datasets/leeroy-jankins/Regulations/tree/main) - Collection of federal regulations on the use of appropriated funds
- [SF-133](https://huggingface.co/datasets/leeroy-jankins/SF133) - The Report on Budget Execution and Budgetary Resources
- [Balances](https://huggingface.co/datasets/leeroy-jankins/Balances) - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- [Outlays](https://huggingface.co/datasets/leeroy-jankins/Outlays) - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- [Circular A11](https://huggingface.co/datasets/leeroy-jankins/OMB-Circular-A-11) - Guidance from OMB on the preparation, submission, and execution of the federal budget
- [Fastbook](https://huggingface.co/datasets/leeroy-jankins/FastBook) - Treasury guidance on federal ledger accounts
- [Title 31 CFR](https://huggingface.co/datasets/leeroy-jankins/Title-31-CFR-Money-and-Finance) - Money & Finance
- [Redbook](https://huggingface.co/datasets/leeroy-jankins/RedBook) - The Principles of Appropriations Law (Volumes I & II).
- [US Standard General Ledger](https://huggingface.co/datasets/leeroy-jankins/US-General-Ledger) - Account Definitions
- [Treasury Appropriation Fund Symbols (TAFSs) Dataset](https://huggingface.co/datasets/leeroy-jankins/Accounts) - Collection of TAFSs used by federal agencies
## β¨ Features
| Feature | Description |
|----------------------------|-----------------------------------------------------------------------------|
| π **Instruction-Tuned** | Fine-tuned on a diverse corpus of natural language tasks for generalization |
| π **Multi-Domain** | Trained on QA, summarization, reasoning, and code synthesis datasets |
| β‘ **Optimized for RAG** | Performs well when integrated with retrieval-augmented generation pipelines |
| π§© **Multi-Turn Dialogue** | Supports coherent conversations with context memory |
| π§ **Compact Intelligence**| 4B parameter scale enables fast inference on consumer GPUs |
---
## π§ͺ Intended Use
Bro is intended for use in:
- Knowledge retrieval systems (RAG)
- Instruction following assistants
- Legal/financial document understanding
- Open-ended question answering
- Text generation and summarization
- Fine-tuning foundation for further specialization
---
## π¬ Technical Details
### Base Model
- **Model**: `gemma-3-1b-pt`
- **Parameters**: ~1.1 Billion
- **Architecture**: Transformer decoder-only
- **Tokenizer**: SentencePiece (32k vocab)
- **Positional Encoding**: Rotary (RoPE)
- **Attention**: Multi-head Self-Attention (MHA)
- **Training Framework**: PyTorch / Hugging Face Transformers
## βοΈ Fine-Tuning
| Property | Value |
|----------------------------|--------------------------------------------------------|
| Dataset Composition | 60% OpenAssistant-style instructions, 20% legal+financial, 10% reasoning chains, 10% dialogues |
| Optimization Strategy | Supervised fine-tuning (SFT) |
| Epochs | 3 |
| Optimizer | AdamW |
| Scheduler | Cosine decay with warmup |
| Mixed Precision | FP16 |
| Context Window | 8192 tokens |
---
## π§ͺ Benchmark Results
| Task | Metric | Bro (Ours) | Base gemma-3-1b |
|--------------------------|-------------------|------------|-----------------|
| ARC Challenge (25-shot) | Accuracy (%) | 71.3 | 64.5 |
| NaturalQuestions (RAG) | EM/F1 | 51.7 / 63.9| 44.2 / 56.8 |
| GSM8K (reasoning) | Accuracy (%) | 62.5 | 52.0 |
| Summarization (CNN/DM) | ROUGE-L | 42.1 | 37.6 |
| MMLU (5-shot, avg) | Accuracy (%) | 56.2 | 48.8 |
> π§ Fine-tuned Bro outperforms base Gemma across all tasks, especially multi-hop reasoning and retrieval QA.
---
## π Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("your-org/Bro")
tokenizer = AutoTokenizer.from_pretrained("your-org/Bro")
prompt = "Explain the difference between supervised and unsupervised learning:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## π Python (Transformers) β Full Weights
Install
pip install "transformers>=4.44.0" accelerate torch --upgrade
Load and generate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned" # replace with your repo/path
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = (
"You are a precise assistant specialized in clinical trial summaries.\n"
"Task: Summarize the following abstract in 4 bullet points, include 1 risk and 1 limitation.\n"
"Abstract: "
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.6,
top_p=0.9
)
print(tok.decode(out[0], skip_special_tokens=True))
Notes
β’ device_map="auto" spreads layers across available devices.
β’ Prefer BF16 if supported; otherwise FP16. For very small GPUs/CPUs, see the 4-bit example.
---
## π§© Python (PEFT) β Adapters on Top of the Base
Install
pip install "transformers>=4.44.0" peft accelerate torch --upgrade
Load base + LoRA/QLoRA
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_id = "google/gemma-3-1b-it" # base model you fine-tuned from
lora_id = "your-namespace/Bro-gemma-3-1b-adapter" # your adapter repo/path
tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
base_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base, lora_id)
prompt = (
"You are an enterprise compliance assistant.\n"
"In JSON, outline a policy review plan with fields: goals[], stakeholders[], risks[], deliverables[]."
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, temperature=0.5, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
---
## πΎ 4-bit (bitsandbytes) β Memory-Efficient Loading
Install
pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade
Load
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb,
device_map="auto"
)
prompt = "Explain, in 5 bullets, how to evaluate domain-specific reasoning abilities in LLMs."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=180, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
---
## π Serve with vLLM (OpenAI-Compatible API)
Install & launch
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model your-namespace/Bro-gemma-3-1b-it-finetuned \
--dtype bfloat16 \
--max-model-len 4096 \
--port 8000
Call the endpoint (Python)
import requests, json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "your-namespace/Bro-gemma-3-1b-it-finetuned",
"messages": [
{"role": "system", "content": "You are concise and evidence-focused."},
{"role": "user", "content": "Give a short rubric to score contextual comprehension on legal docs."}
],
"temperature": 0.6,
"max_tokens": 220,
"stream": True
}
with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r:
for line in r.iter_lines():
if line and line.startswith(b"data: "):
chunk = line[len(b"data: "):].decode("utf-8")
if chunk == "[DONE]":
break
print(chunk, flush=True)
---
## π¦ Serve with Text Generation Inference (TGI)
Run the server (Docker)
docker run --gpus all --shm-size 1g -p 8080:80 \
-e MODEL_ID=your-namespace/Bro-gemma-3-1b-it-finetuned \
ghcr.io/huggingface/text-generation-inference:latest
Call the server (HTTP)
curl http://localhost:8080/generate \
-X POST -d '{
"inputs": "Outline a domain-specific reasoning test plan for an insurance Q&A bot.",
"parameters": {"max_new_tokens": 220, "temperature": 0.6, "top_p": 0.9}
}' \
-H "Content-Type: application/json"
---
## π₯οΈ LM Studio (GGUF workflow)
If you export **Bro** to **GGUF**, you can run it in LM Studio. One typical workflow is:
1) Convert HF β GGUF with llama.cppβs conversion script (example; confirm flags for Gemma 3):
β’ git clone https://github.com/ggerganov/llama.cpp
β’ cd llama.cpp
β’ python3 convert-hf-to-gguf.py /path/to/your/Bro-hf-dir --outfile Bro-f32.gguf
2) Quantize to Q4_K_M (or similar) for local inference:
β’ ./quantize Bro-f32.gguf Bro.Q4_K_M.gguf Q4_K_M
3) Open LM Studio β Local Models β Import β select Bro.Q4_K_M.gguf
4) In the chat pane, set conservative parameters:
β’ Temperature: 0.5β0.7
β’ Max new tokens: 128β384
β’ (If available) repeat penalty ~1.05β1.15
5) Prompt example:
"Summarize the attached clinical guidance in 6 bullets. Include contraindications and monitoring."
Notes
β’ Exact conversion flags can differ by model family; verify Gemma-3 options in your llama.cpp version.
β’ If you distribute only HF weights, consider LM Studioβs server/backends that accept HF models.
---
## π§ Prompt Patterns (Contextual + Domain)
Context-grounded Q&A
System: You answer strictly using the provided context. If missing, say "I don't know."
User: Use the context to answer. Keep to 5 bullets.
Context:
β’
β’
Question:
Constrained JSON
System: Output only valid JSON. No explanation.
User: Return {"summary":"", "risks":[""], "actions":[""], "open_questions":[""]} for the content.
Evaluation rubric (short)
In 6 bullets, define a rubric to judge contextual comprehension on domain X.
Use criteria: correctness, citation use, scope, clarity, uncertainty handling, follow-up.
## π Prompting Engineering
No special chat template is strictly required. Use clear instructions and keep prompts concise. For
multi-turn workflows, persist conversation state externally or via your appβs memory/RAG layer.
Example system style
You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.
- [Guro](https://github.com/is-leeroy-jenkins/Guro?tab=readme-ov-file#guro) is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
- From academic writing to financial analysis, technical support, SEO, and beyond
- Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.
---
## π Basic RAG
# Retrieve k chunks
chunks = retriever.search("billing code coverage for outpatient procedures", k=5)
# Build prompt
context = "\n".join([f"β’ {c.text} [{c.source}]" for c in chunks])
prompt = f"""
You are a helpful domain assistant. Answer only from the context.
Context:
{context}
Question:
What are the coverage criteria and documentation requirements?
"""
# Generate (Transformers / vLLM / TGI)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=220, temperature=0.5, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
### π 1. Document Ingestion
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("reference_material.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = splitter.split_documents(documents)
---
### π 2. Embedding & Vector Indexing
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)
---
### π 3. Retrieval + Prompt Formatting
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
retrieved_docs = retriever.get_relevant_documents("How does RAG improve factual accuracy?")
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""
You are Bro, a domain-aware assistant. Use the retrieved context below to answer accurately:
{context}
How does RAG improve factual accuracy?
"""
---
### π§ 4. LLM Inference with Bro
./main -m Bro.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color
> The output will be Bro's grounded and concise answer, using the embedded context to avoid hallucinations.
---
### π Notes
- **Bro** (gemma-3-1b-it variant) runs efficiently on CPU or with GPU offload via `llama.cpp`.
- All context is explicitly retrieved; no external APIs are involved.
- You can improve results by tuning chunk size, overlap, or using a domain-specific embedding model.
---
## βοΈ Parameter Tips
β’ Temperature: 0.5β0.8 (lower for deterministic policy/summary tasks)
β’ Top-p: 0.8β0.95 (tune one knob at a time)
β’ Max new tokens: 128β384 for chat; longer for drafts
β’ Repeat penalty: 1.05β1.2 if repetition occurs
β’ Context length: set to your Bro build; compress with selective retrieval
---
## π Troubleshooting
β’ CUDA OOM:
Lower max_new_tokens; use 4-bit; reduce context; shard across GPUs.
β’ Messy JSON:
Use a JSON-only system prompt; set temperature β€0.6; include a minimal schema.
β’ Weak domain grounding:
Improve retrieval quality; add citations; constrain scope in the prompt.
β’ Inconsistent style:
Provide one/two-shot examples; pin a style guide in the system message.
## πLicense
- Bro is published under the [MIT General Public License v3](https://huggingface.co/leeroy-jankins/bro/blob/main/LICENSE.txt)