---
license: mit
language:
- en
tags:
- code
- finance
datasets:
- mlabonne/FineTome-100k
- leeroy-jankins/Regulations
- leeroy-jankins/Appropriations
- leeroy-jankins/OMB-Circular-A-11
- leeroy-jankins/RedBook
- leeroy-jankins/SF133
- leeroy-jankins/US-General-Ledger
- leeroy-jankins/Title-31-CFR-Money-and-Finance
base_model:
- unsloth/gemma-3-1b-it-GGUF
pipeline_tag: text-generation
metrics:
- accuracy
---
<img src="assets/Bro.png" alt="Preview" width="1000"/>

## 🎯 Overview

**Bro** is a LLM fine-tuned variant of the `gemma-3-1b-it` transformer model, optimized for enhanced contextual comprehension, instruction following, and domain-specific reasoning. The fine-tuning process used supervised instruction tuning across multiple NLP domains, with a focus on factual recall, multi-step reasoning, and document comprehension.
Built on the lightweight yet powerful `Gemma 3 1B` architecture, **Bro** provides a balance between inference speed and linguistic depth — making it suitable for both production deployment and academic research.


## 🚀 Use with Streamlit

<a href="https://bro-py.streamlit.app/" target="_parent">
<img src="https://img.shields.io/badge/Streamlit-App-FF4B4B?logo=streamlit&logoColor=white" alt="Open In Streamlit"/></a>

<img src="assets/Bro-streamlit.gif" alt="Preview" width="1000"/>

## 🧩 Code Repository

<a href="https://github.com/is-leeroy-jenkins/Bro?tab=readme-ov-file#bro" target="_parent">
<img src="https://img.shields.io/badge/github-repo-blue?logo=github">


## ⚙️ Vectorized Datasets
> Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned.
> It can help improve the execution speed and reduce the training time of your code. 
> BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- [Appropriations](https://huggingface.co/datasets/leeroy-jankins/Appropriations) - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- [Regulations](https://huggingface.co/datasets/leeroy-jankins/Regulations/tree/main) - Collection of federal regulations on the use of appropriated funds
- [SF-133](https://huggingface.co/datasets/leeroy-jankins/SF133) - The Report on Budget Execution and Budgetary Resources
- [Balances](https://huggingface.co/datasets/leeroy-jankins/Balances) -  U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- [Outlays](https://huggingface.co/datasets/leeroy-jankins/Outlays) -  The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- [Circular A11](https://huggingface.co/datasets/leeroy-jankins/OMB-Circular-A-11) - Guidance from OMB on the preparation, submission, and execution of the federal budget
- [Fastbook](https://huggingface.co/datasets/leeroy-jankins/FastBook) - Treasury guidance on federal ledger accounts
- [Title 31 CFR](https://huggingface.co/datasets/leeroy-jankins/Title-31-CFR-Money-and-Finance) - Money & Finance
- [Redbook](https://huggingface.co/datasets/leeroy-jankins/RedBook) - The Principles of Appropriations Law (Volumes I & II).
- [US Standard General Ledger](https://huggingface.co/datasets/leeroy-jankins/US-General-Ledger) - Account Definitions
- [Treasury Appropriation Fund Symbols (TAFSs) Dataset](https://huggingface.co/datasets/leeroy-jankins/Accounts) - Collection of TAFSs used by federal agencies


## ✨ Features

| Feature                     | Description                                                                 |
|----------------------------|-----------------------------------------------------------------------------|
| 🔍 **Instruction-Tuned**   | Fine-tuned on a diverse corpus of natural language tasks for generalization |
| 📚 **Multi-Domain**        | Trained on QA, summarization, reasoning, and code synthesis datasets        |
| ⚡ **Optimized for RAG**    | Performs well when integrated with retrieval-augmented generation pipelines |
| 🧩 **Multi-Turn Dialogue** | Supports coherent conversations with context memory                         |
| 🧠 **Compact Intelligence**| 4B parameter scale enables fast inference on consumer GPUs                  |

---

## 🧪 Intended Use

Bro is intended for use in:

- Knowledge retrieval systems (RAG)
- Instruction following assistants
- Legal/financial document understanding
- Open-ended question answering
- Text generation and summarization
- Fine-tuning foundation for further specialization

---

## 🔬 Technical Details

### Base Model

- **Model**: `gemma-3-1b-pt`
- **Parameters**: ~1.1 Billion
- **Architecture**: Transformer decoder-only
- **Tokenizer**: SentencePiece (32k vocab)
- **Positional Encoding**: Rotary (RoPE)
- **Attention**: Multi-head Self-Attention (MHA)
- **Training Framework**: PyTorch / Hugging Face Transformers

## ⚙️ Fine-Tuning

| Property                    | Value                                                  |
|----------------------------|--------------------------------------------------------|
| Dataset Composition        | 60% OpenAssistant-style instructions, 20% legal+financial, 10% reasoning chains, 10% dialogues |
| Optimization Strategy      | Supervised fine-tuning (SFT)                           |
| Epochs                     | 3                                                      |
| Optimizer                  | AdamW                                                  |
| Scheduler                  | Cosine decay with warmup                               |
| Mixed Precision            | FP16                                                   |
| Context Window             | 8192 tokens                                            |

---

## 🧪 Benchmark Results

| Task                      | Metric            | Bro (Ours) | Base gemma-3-1b |
|--------------------------|-------------------|------------|-----------------|
| ARC Challenge (25-shot)  | Accuracy (%)       | 71.3       | 64.5            |
| NaturalQuestions (RAG)   | EM/F1              | 51.7 / 63.9| 44.2 / 56.8     |
| GSM8K (reasoning)        | Accuracy (%)       | 62.5       | 52.0            |
| Summarization (CNN/DM)   | ROUGE-L            | 42.1       | 37.6            |
| MMLU (5-shot, avg)       | Accuracy (%)       | 56.2       | 48.8            |

> 🧠 Fine-tuned Bro outperforms base Gemma across all tasks, especially multi-hop reasoning and retrieval QA.

---

## 🚀 Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("your-org/Bro")
tokenizer = AutoTokenizer.from_pretrained("your-org/Bro")

prompt = "Explain the difference between supervised and unsupervised learning:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## 🐍 Python (Transformers) — Full Weights

Install

    pip install "transformers>=4.44.0" accelerate torch --upgrade

Load and generate

    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned"  # replace with your repo/path
    tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    prompt = (
        "You are a precise assistant specialized in clinical trial summaries.\n"
        "Task: Summarize the following abstract in 4 bullet points, include 1 risk and 1 limitation.\n"
        "Abstract: <paste text here>"
    )
    inputs = tok(prompt, return_tensors="pt").to(model.device)

    out = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.6,
        top_p=0.9
    )
    print(tok.decode(out[0], skip_special_tokens=True))

Notes

    • device_map="auto" spreads layers across available devices.
    • Prefer BF16 if supported; otherwise FP16. For very small GPUs/CPUs, see the 4-bit example.

---

## 🧩 Python (PEFT) — Adapters on Top of the Base

Install

    pip install "transformers>=4.44.0" peft accelerate torch --upgrade

Load base + LoRA/QLoRA

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel
    import torch

    base_id = "google/gemma-3-1b-it"                 # base model you fine-tuned from
    lora_id = "your-namespace/Bro-gemma-3-1b-adapter" # your adapter repo/path

    tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
    base = AutoModelForCausalLM.from_pretrained(
        base_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    model = PeftModel.from_pretrained(base, lora_id)

    prompt = (
        "You are an enterprise compliance assistant.\n"
        "In JSON, outline a policy review plan with fields: goals[], stakeholders[], risks[], deliverables[]."
    )
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=200, temperature=0.5, top_p=0.9)
    print(tok.decode(out[0], skip_special_tokens=True))

---

## 💾 4-bit (bitsandbytes) — Memory-Efficient Loading

Install

    pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade

Load

    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    import torch

    bnb = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned"
    tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb,
        device_map="auto"
    )

    prompt = "Explain, in 5 bullets, how to evaluate domain-specific reasoning abilities in LLMs."
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=180, temperature=0.6, top_p=0.9)
    print(tok.decode(out[0], skip_special_tokens=True))

---

## 🚀 Serve with vLLM (OpenAI-Compatible API)

Install & launch

    pip install vllm
    python -m vllm.entrypoints.openai.api_server \
      --model your-namespace/Bro-gemma-3-1b-it-finetuned \
      --dtype bfloat16 \
      --max-model-len 4096 \
      --port 8000

Call the endpoint (Python)

    import requests, json
    url = "http://localhost:8000/v1/chat/completions"
    headers = {"Content-Type": "application/json"}
    data = {
      "model": "your-namespace/Bro-gemma-3-1b-it-finetuned",
      "messages": [
        {"role": "system", "content": "You are concise and evidence-focused."},
        {"role": "user", "content": "Give a short rubric to score contextual comprehension on legal docs."}
      ],
      "temperature": 0.6,
      "max_tokens": 220,
      "stream": True
    }
    with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r:
        for line in r.iter_lines():
            if line and line.startswith(b"data: "):
                chunk = line[len(b"data: "):].decode("utf-8")
                if chunk == "[DONE]":
                    break
                print(chunk, flush=True)

---

## 📦 Serve with Text Generation Inference (TGI)

Run the server (Docker)

    docker run --gpus all --shm-size 1g -p 8080:80 \
      -e MODEL_ID=your-namespace/Bro-gemma-3-1b-it-finetuned \
      ghcr.io/huggingface/text-generation-inference:latest

Call the server (HTTP)

    curl http://localhost:8080/generate \
      -X POST -d '{
        "inputs": "Outline a domain-specific reasoning test plan for an insurance Q&A bot.",
        "parameters": {"max_new_tokens": 220, "temperature": 0.6, "top_p": 0.9}
      }' \
      -H "Content-Type: application/json"

---

## 🖥️ LM Studio (GGUF workflow)

If you export **Bro** to **GGUF**, you can run it in LM Studio. One typical workflow is:

    1) Convert HF → GGUF with llama.cpp’s conversion script (example; confirm flags for Gemma 3):
       • git clone https://github.com/ggerganov/llama.cpp
       • cd llama.cpp
       • python3 convert-hf-to-gguf.py /path/to/your/Bro-hf-dir --outfile Bro-f32.gguf
    2) Quantize to Q4_K_M (or similar) for local inference:
       • ./quantize Bro-f32.gguf Bro.Q4_K_M.gguf Q4_K_M
    3) Open LM Studio → Local Models → Import → select Bro.Q4_K_M.gguf
    4) In the chat pane, set conservative parameters:
       • Temperature: 0.5–0.7
       • Max new tokens: 128–384
       • (If available) repeat penalty ~1.05–1.15
    5) Prompt example:
       "Summarize the attached clinical guidance in 6 bullets. Include contraindications and monitoring."

Notes

    • Exact conversion flags can differ by model family; verify Gemma-3 options in your llama.cpp version.
    • If you distribute only HF weights, consider LM Studio’s server/backends that accept HF models.

---

## 🧠 Prompt Patterns (Contextual + Domain)

Context-grounded Q&A

    System: You answer strictly using the provided context. If missing, say "I don't know."
    User: Use the context to answer. Keep to 5 bullets.
    Context:
    • <chunk 1 [source/citation]>
    • <chunk 2 [source/citation]>
    Question: <domain question here>

Constrained JSON

    System: Output only valid JSON. No explanation.
    User: Return {"summary":"", "risks":[""], "actions":[""], "open_questions":[""]} for the content.

Evaluation rubric (short)

    In 6 bullets, define a rubric to judge contextual comprehension on domain X.
    Use criteria: correctness, citation use, scope, clarity, uncertainty handling, follow-up.

## 📝 Prompting Engineering

No special chat template is strictly required. Use clear instructions and keep prompts concise. For
multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.

Example system style

    You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
    Cite assumptions and ask for missing constraints.
    
 
- [Guro](https://github.com/is-leeroy-jenkins/Guro?tab=readme-ov-file#guro)  is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
- From academic writing to financial analysis, technical support, SEO, and beyond 
- Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

---

## 📚 Basic RAG  

    # Retrieve k chunks
    chunks = retriever.search("billing code coverage for outpatient procedures", k=5)

    # Build prompt
    context = "\n".join([f"• {c.text} [{c.source}]" for c in chunks])
    prompt = f"""
    You are a helpful domain assistant. Answer only from the context.
    Context:
    {context}

    Question:
    What are the coverage criteria and documentation requirements?
    """

    # Generate (Transformers / vLLM / TGI)
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=220, temperature=0.5, top_p=0.9)
    print(tok.decode(out[0], skip_special_tokens=True))

### 📁 1. Document Ingestion

    from langchain.document_loaders import TextLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    loader = TextLoader("reference_material.txt")
    documents = loader.load()

    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    docs = splitter.split_documents(documents)

---

### 🔍 2. Embedding & Vector Indexing

    from langchain.embeddings import HuggingFaceEmbeddings
    from langchain.vectorstores import FAISS

    embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(docs, embedding)

---

### 🔄 3. Retrieval + Prompt Formatting

    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    retrieved_docs = retriever.get_relevant_documents("How does RAG improve factual accuracy?")

    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    prompt = f"""
    You are Bro, a domain-aware assistant. Use the retrieved context below to answer accurately:

    <context>
    {context}
    </context>

    <question>
    How does RAG improve factual accuracy?
    </question>
    """

---

### 🧠 4. LLM Inference with Bro

    ./main -m Bro.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color

> The output will be Bro's grounded and concise answer, using the embedded context to avoid hallucinations.

---

### 📝 Notes

- **Bro** (gemma-3-1b-it variant) runs efficiently on CPU or with GPU offload via `llama.cpp`.
- All context is explicitly retrieved; no external APIs are involved.
- You can improve results by tuning chunk size, overlap, or using a domain-specific embedding model.

---

## ⚙️ Parameter Tips

    • Temperature: 0.5–0.8 (lower for deterministic policy/summary tasks)
    • Top-p: 0.8–0.95 (tune one knob at a time)
    • Max new tokens: 128–384 for chat; longer for drafts
    • Repeat penalty: 1.05–1.2 if repetition occurs
    • Context length: set to your Bro build; compress with selective retrieval

---

## 🛟 Troubleshooting

    • CUDA OOM:
      Lower max_new_tokens; use 4-bit; reduce context; shard across GPUs.
    • Messy JSON:
      Use a JSON-only system prompt; set temperature ≤0.6; include a minimal schema.
    • Weak domain grounding:
      Improve retrieval quality; add citations; constrain scope in the prompt.
    • Inconsistent style:
      Provide one/two-shot examples; pin a style guide in the system message.

## 📝License
- Bro is published under the [MIT General Public License v3](https://huggingface.co/leeroy-jankins/bro/blob/main/LICENSE.txt)