--- license: mit language: - en tags: - code - finance datasets: - mlabonne/FineTome-100k - leeroy-jankins/Regulations - leeroy-jankins/Appropriations - leeroy-jankins/OMB-Circular-A-11 - leeroy-jankins/RedBook - leeroy-jankins/SF133 - leeroy-jankins/US-General-Ledger - leeroy-jankins/Title-31-CFR-Money-and-Finance base_model: - unsloth/gemma-3-1b-it-GGUF pipeline_tag: text-generation metrics: - accuracy --- Preview ## 🎯 Overview **Bro** is a LLM fine-tuned variant of the `gemma-3-1b-it` transformer model, optimized for enhanced contextual comprehension, instruction following, and domain-specific reasoning. The fine-tuning process used supervised instruction tuning across multiple NLP domains, with a focus on factual recall, multi-step reasoning, and document comprehension. Built on the lightweight yet powerful `Gemma 3 1B` architecture, **Bro** provides a balance between inference speed and linguistic depth β€” making it suitable for both production deployment and academic research. ## πŸš€ Use with Streamlit Open In Streamlit Preview ## 🧩 Code Repository ## βš™οΈ Vectorized Datasets > Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. > It can help improve the execution speed and reduce the training time of your code. > BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning - [Appropriations](https://huggingface.co/datasets/leeroy-jankins/Appropriations) - Enacted appropriations from 1996-2024 available for fine-tuning learning models - [Regulations](https://huggingface.co/datasets/leeroy-jankins/Regulations/tree/main) - Collection of federal regulations on the use of appropriated funds - [SF-133](https://huggingface.co/datasets/leeroy-jankins/SF133) - The Report on Budget Execution and Budgetary Resources - [Balances](https://huggingface.co/datasets/leeroy-jankins/Balances) - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014. - [Outlays](https://huggingface.co/datasets/leeroy-jankins/Outlays) - The actual disbursements of funds by the U.S. federal government from 1962 to 2025 - [Circular A11](https://huggingface.co/datasets/leeroy-jankins/OMB-Circular-A-11) - Guidance from OMB on the preparation, submission, and execution of the federal budget - [Fastbook](https://huggingface.co/datasets/leeroy-jankins/FastBook) - Treasury guidance on federal ledger accounts - [Title 31 CFR](https://huggingface.co/datasets/leeroy-jankins/Title-31-CFR-Money-and-Finance) - Money & Finance - [Redbook](https://huggingface.co/datasets/leeroy-jankins/RedBook) - The Principles of Appropriations Law (Volumes I & II). - [US Standard General Ledger](https://huggingface.co/datasets/leeroy-jankins/US-General-Ledger) - Account Definitions - [Treasury Appropriation Fund Symbols (TAFSs) Dataset](https://huggingface.co/datasets/leeroy-jankins/Accounts) - Collection of TAFSs used by federal agencies ## ✨ Features | Feature | Description | |----------------------------|-----------------------------------------------------------------------------| | πŸ” **Instruction-Tuned** | Fine-tuned on a diverse corpus of natural language tasks for generalization | | πŸ“š **Multi-Domain** | Trained on QA, summarization, reasoning, and code synthesis datasets | | ⚑ **Optimized for RAG** | Performs well when integrated with retrieval-augmented generation pipelines | | 🧩 **Multi-Turn Dialogue** | Supports coherent conversations with context memory | | 🧠 **Compact Intelligence**| 4B parameter scale enables fast inference on consumer GPUs | --- ## πŸ§ͺ Intended Use Bro is intended for use in: - Knowledge retrieval systems (RAG) - Instruction following assistants - Legal/financial document understanding - Open-ended question answering - Text generation and summarization - Fine-tuning foundation for further specialization --- ## πŸ”¬ Technical Details ### Base Model - **Model**: `gemma-3-1b-pt` - **Parameters**: ~1.1 Billion - **Architecture**: Transformer decoder-only - **Tokenizer**: SentencePiece (32k vocab) - **Positional Encoding**: Rotary (RoPE) - **Attention**: Multi-head Self-Attention (MHA) - **Training Framework**: PyTorch / Hugging Face Transformers ## βš™οΈ Fine-Tuning | Property | Value | |----------------------------|--------------------------------------------------------| | Dataset Composition | 60% OpenAssistant-style instructions, 20% legal+financial, 10% reasoning chains, 10% dialogues | | Optimization Strategy | Supervised fine-tuning (SFT) | | Epochs | 3 | | Optimizer | AdamW | | Scheduler | Cosine decay with warmup | | Mixed Precision | FP16 | | Context Window | 8192 tokens | --- ## πŸ§ͺ Benchmark Results | Task | Metric | Bro (Ours) | Base gemma-3-1b | |--------------------------|-------------------|------------|-----------------| | ARC Challenge (25-shot) | Accuracy (%) | 71.3 | 64.5 | | NaturalQuestions (RAG) | EM/F1 | 51.7 / 63.9| 44.2 / 56.8 | | GSM8K (reasoning) | Accuracy (%) | 62.5 | 52.0 | | Summarization (CNN/DM) | ROUGE-L | 42.1 | 37.6 | | MMLU (5-shot, avg) | Accuracy (%) | 56.2 | 48.8 | > 🧠 Fine-tuned Bro outperforms base Gemma across all tasks, especially multi-hop reasoning and retrieval QA. --- ## πŸš€ Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("your-org/Bro") tokenizer = AutoTokenizer.from_pretrained("your-org/Bro") prompt = "Explain the difference between supervised and unsupervised learning:" inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=150) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## 🐍 Python (Transformers) β€” Full Weights Install pip install "transformers>=4.44.0" accelerate torch --upgrade Load and generate from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned" # replace with your repo/path tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = ( "You are a precise assistant specialized in clinical trial summaries.\n" "Task: Summarize the following abstract in 4 bullet points, include 1 risk and 1 limitation.\n" "Abstract: " ) inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=256, temperature=0.6, top_p=0.9 ) print(tok.decode(out[0], skip_special_tokens=True)) Notes β€’ device_map="auto" spreads layers across available devices. β€’ Prefer BF16 if supported; otherwise FP16. For very small GPUs/CPUs, see the 4-bit example. --- ## 🧩 Python (PEFT) β€” Adapters on Top of the Base Install pip install "transformers>=4.44.0" peft accelerate torch --upgrade Load base + LoRA/QLoRA from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base_id = "google/gemma-3-1b-it" # base model you fine-tuned from lora_id = "your-namespace/Bro-gemma-3-1b-adapter" # your adapter repo/path tok = AutoTokenizer.from_pretrained(base_id, use_fast=True) base = AutoModelForCausalLM.from_pretrained( base_id, torch_dtype=torch.bfloat16, device_map="auto" ) model = PeftModel.from_pretrained(base, lora_id) prompt = ( "You are an enterprise compliance assistant.\n" "In JSON, outline a policy review plan with fields: goals[], stakeholders[], risks[], deliverables[]." ) inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=200, temperature=0.5, top_p=0.9) print(tok.decode(out[0], skip_special_tokens=True)) --- ## πŸ’Ύ 4-bit (bitsandbytes) β€” Memory-Efficient Loading Install pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade Load from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model_id = "your-namespace/Bro-gemma-3-1b-it-finetuned" tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb, device_map="auto" ) prompt = "Explain, in 5 bullets, how to evaluate domain-specific reasoning abilities in LLMs." inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=180, temperature=0.6, top_p=0.9) print(tok.decode(out[0], skip_special_tokens=True)) --- ## πŸš€ Serve with vLLM (OpenAI-Compatible API) Install & launch pip install vllm python -m vllm.entrypoints.openai.api_server \ --model your-namespace/Bro-gemma-3-1b-it-finetuned \ --dtype bfloat16 \ --max-model-len 4096 \ --port 8000 Call the endpoint (Python) import requests, json url = "http://localhost:8000/v1/chat/completions" headers = {"Content-Type": "application/json"} data = { "model": "your-namespace/Bro-gemma-3-1b-it-finetuned", "messages": [ {"role": "system", "content": "You are concise and evidence-focused."}, {"role": "user", "content": "Give a short rubric to score contextual comprehension on legal docs."} ], "temperature": 0.6, "max_tokens": 220, "stream": True } with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r: for line in r.iter_lines(): if line and line.startswith(b"data: "): chunk = line[len(b"data: "):].decode("utf-8") if chunk == "[DONE]": break print(chunk, flush=True) --- ## πŸ“¦ Serve with Text Generation Inference (TGI) Run the server (Docker) docker run --gpus all --shm-size 1g -p 8080:80 \ -e MODEL_ID=your-namespace/Bro-gemma-3-1b-it-finetuned \ ghcr.io/huggingface/text-generation-inference:latest Call the server (HTTP) curl http://localhost:8080/generate \ -X POST -d '{ "inputs": "Outline a domain-specific reasoning test plan for an insurance Q&A bot.", "parameters": {"max_new_tokens": 220, "temperature": 0.6, "top_p": 0.9} }' \ -H "Content-Type: application/json" --- ## πŸ–₯️ LM Studio (GGUF workflow) If you export **Bro** to **GGUF**, you can run it in LM Studio. One typical workflow is: 1) Convert HF β†’ GGUF with llama.cpp’s conversion script (example; confirm flags for Gemma 3): β€’ git clone https://github.com/ggerganov/llama.cpp β€’ cd llama.cpp β€’ python3 convert-hf-to-gguf.py /path/to/your/Bro-hf-dir --outfile Bro-f32.gguf 2) Quantize to Q4_K_M (or similar) for local inference: β€’ ./quantize Bro-f32.gguf Bro.Q4_K_M.gguf Q4_K_M 3) Open LM Studio β†’ Local Models β†’ Import β†’ select Bro.Q4_K_M.gguf 4) In the chat pane, set conservative parameters: β€’ Temperature: 0.5–0.7 β€’ Max new tokens: 128–384 β€’ (If available) repeat penalty ~1.05–1.15 5) Prompt example: "Summarize the attached clinical guidance in 6 bullets. Include contraindications and monitoring." Notes β€’ Exact conversion flags can differ by model family; verify Gemma-3 options in your llama.cpp version. β€’ If you distribute only HF weights, consider LM Studio’s server/backends that accept HF models. --- ## 🧠 Prompt Patterns (Contextual + Domain) Context-grounded Q&A System: You answer strictly using the provided context. If missing, say "I don't know." User: Use the context to answer. Keep to 5 bullets. Context: β€’ β€’ Question: Constrained JSON System: Output only valid JSON. No explanation. User: Return {"summary":"", "risks":[""], "actions":[""], "open_questions":[""]} for the content. Evaluation rubric (short) In 6 bullets, define a rubric to judge contextual comprehension on domain X. Use criteria: correctness, citation use, scope, clarity, uncertainty handling, follow-up. ## πŸ“ Prompting Engineering No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer. Example system style You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed. Cite assumptions and ask for missing constraints. - [Guro](https://github.com/is-leeroy-jenkins/Guro?tab=readme-ov-file#guro) is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos. - From academic writing to financial analysis, technical support, SEO, and beyond - Guro provides precision-crafted prompt templates ready to drop into your LLM workflows. --- ## πŸ“š Basic RAG # Retrieve k chunks chunks = retriever.search("billing code coverage for outpatient procedures", k=5) # Build prompt context = "\n".join([f"β€’ {c.text} [{c.source}]" for c in chunks]) prompt = f""" You are a helpful domain assistant. Answer only from the context. Context: {context} Question: What are the coverage criteria and documentation requirements? """ # Generate (Transformers / vLLM / TGI) inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=220, temperature=0.5, top_p=0.9) print(tok.decode(out[0], skip_special_tokens=True)) ### πŸ“ 1. Document Ingestion from langchain.document_loaders import TextLoader from langchain.text_splitter import RecursiveCharacterTextSplitter loader = TextLoader("reference_material.txt") documents = loader.load() splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100) docs = splitter.split_documents(documents) --- ### πŸ” 2. Embedding & Vector Indexing from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import FAISS embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") vectorstore = FAISS.from_documents(docs, embedding) --- ### πŸ”„ 3. Retrieval + Prompt Formatting retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) retrieved_docs = retriever.get_relevant_documents("How does RAG improve factual accuracy?") context = "\n\n".join([doc.page_content for doc in retrieved_docs]) prompt = f""" You are Bro, a domain-aware assistant. Use the retrieved context below to answer accurately: {context} How does RAG improve factual accuracy? """ --- ### 🧠 4. LLM Inference with Bro ./main -m Bro.Q4_K_M.gguf -p "$prompt" -n 512 -t 8 -c 2048 --color > The output will be Bro's grounded and concise answer, using the embedded context to avoid hallucinations. --- ### πŸ“ Notes - **Bro** (gemma-3-1b-it variant) runs efficiently on CPU or with GPU offload via `llama.cpp`. - All context is explicitly retrieved; no external APIs are involved. - You can improve results by tuning chunk size, overlap, or using a domain-specific embedding model. --- ## βš™οΈ Parameter Tips β€’ Temperature: 0.5–0.8 (lower for deterministic policy/summary tasks) β€’ Top-p: 0.8–0.95 (tune one knob at a time) β€’ Max new tokens: 128–384 for chat; longer for drafts β€’ Repeat penalty: 1.05–1.2 if repetition occurs β€’ Context length: set to your Bro build; compress with selective retrieval --- ## πŸ›Ÿ Troubleshooting β€’ CUDA OOM: Lower max_new_tokens; use 4-bit; reduce context; shard across GPUs. β€’ Messy JSON: Use a JSON-only system prompt; set temperature ≀0.6; include a minimal schema. β€’ Weak domain grounding: Improve retrieval quality; add citations; constrain scope in the prompt. β€’ Inconsistent style: Provide one/two-shot examples; pin a style guide in the system message. ## πŸ“License - Bro is published under the [MIT General Public License v3](https://huggingface.co/leeroy-jankins/bro/blob/main/LICENSE.txt)