--- base_model: Qwen/Qwen3-1.7B library_name: mlx datasets: - Skywork/Skywork-Reward-Preference-80K-v0.2 - allenai/reward-bench language: - en license: apache-2.0 pipeline_tag: text-generation tags: - mlx - reward-model - judge-model - grpo - lora - spct - apple-silicon model-index: - name: j1-micro results: - task: type: reward-modeling name: Reward Modeling dataset: name: RewardBench type: allenai/reward-bench metrics: - type: accuracy value: 80.7 name: RewardBench Accuracy (reported by Haize Labs) - task: type: reward-modeling name: Reward Modeling (MLX 4-bit) dataset: name: RewardBench (100-sample subset) type: allenai/reward-bench metrics: - type: accuracy value: 75.0 name: RewardBench Accuracy (MLX 4-bit quantized) --- # j1-micro-1.7B (MLX 4-bit Quantized) MLX 4-bit quantized version of [Haize Labs' j1-micro](https://huggingface.co/haize-labs/j1-micro), a 1.7B judge/reward model that matches Claude-3-Opus and GPT-4o-mini on RewardBench (80.7%) despite being 100x smaller. This repo contains the **MLX 4-bit quantized weights** for fast inference on Apple Silicon Macs, plus the original **LoRA adapter** for GPU inference via vLLM. ## What This Model Does j1-micro is a **pairwise preference judge**: given two responses, it generates a structured rubric, reasons through it, and scores each response. Trained with GRPO (Group Relative Policy Optimization) + SPCT (Self-Principled Critique Tuning) on Skywork Preference 80K. The model invents its own evaluation criteria per query, then scores against them. This structured reasoning is why 1.7B beats 400B+ models. ## Performance | Model | Params | RewardBench | |-------|--------|:-----------:| | Tulu-2-70b | 70B | 77.2% | | Llama-3-70B-Instruct | 70B | 77.0% | | Claude-3-Opus | 200B+ | 80.1% | | GPT-4o-mini | ~8B | 80.1% | | **j1-micro (LoRA, FP16)** | **1.7B** | **80.7%** | | **j1-micro (MLX 4-bit)** | **1.7B** | **75.0%** | MLX 4-bit quantized performance on 100-sample RewardBench subset: - **Accuracy:** 75.0% (0% format error rate) - **Latency:** ~3.0s avg, 2.9s p50, 3.8s p95 (M-series Mac) - **Memory:** 2.0 GB peak ## Files ``` mlx/ # MLX 4-bit quantized (Apple Silicon) model.safetensors # 968 MB config.json tokenizer.json tokenizer_config.json ... lora/ # LoRA adapter (GPU via vLLM/PEFT) adapter_model.safetensors # 67 MB adapter_config.json tokenizer.json ... ``` ## Quick Start (MLX on Mac) ```bash pip install mlx-lm ``` ```python from mlx_lm import load, generate model, tokenizer = load("rachittshah/j1-micro", model_config={"subfolder": "mlx"}) SYSTEM = """You are an expert XML wrangler. You must respond in the following format: ... ... \\boxed{..., ...} Please only respond in English.""" prompt = """You are a skilled little expert at scoring responses... #### Conversation Context #### What is the capital of France? #### Responses to be Scored #### [The Begin of Response A] The capital of France is Paris, located in northern France along the Seine River. [The End of Response A] [The Begin of Response B] France's capital is Lyon, a major city in southeastern France. [The End of Response B]""" messages = [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": prompt}, ] formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) response = generate(model, tokenizer, prompt=formatted, max_tokens=2048) print(response) ``` ## Quick Start (vLLM with LoRA) ```bash # Download and serve with vLLM vllm serve Qwen/Qwen3-1.7B \ --enable-lora \ --lora-modules j1-micro=rachittshah/j1-micro/lora # Or load adapter with PEFT from peft import PeftModel from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B") model = PeftModel.from_pretrained(model, "rachittshah/j1-micro", subfolder="lora") ``` ## Output Format The model outputs structured XML: ```xml 1. Factual accuracy (weight: 0.35) — correctness of stated facts 2. Specificity (weight: 0.25) — concrete details vs vague claims 3. Completeness (weight: 0.2) — coverage of the topic 4. Clarity (weight: 0.2) — clear, well-organized explanation Response A: Factual accuracy 9/10 — correctly identifies Paris... Response B: Factual accuracy 2/10 — incorrectly states Lyon... \boxed{8, 3} ``` ## Training Details - **Base model:** Qwen/Qwen3-1.7B (Apache 2.0) - **Method:** GRPO + SPCT (Self-Principled Critique Tuning) - **Data:** Skywork-Reward-Preference-80K-v0.2 - **LoRA:** rank=16, alpha=32, dropout=0.1, all attention + MLP projections - **Hardware:** 1x A100 80GB, <24h training - **Cost:** ~$25 ## Citation Original model by [Haize Labs](https://github.com/haizelabs/j1-micro): ```bibtex @misc{j1micro2025, title = {j1-micro and j1-nano: Tiny Generalist Reward Models via Inference-Time Rubric Proposal}, author = {Haize Labs}, url = {https://github.com/haizelabs/j1-micro}, month = {May}, year = {2025} } ``` ## License Apache 2.0 (both base model Qwen3-1.7B and LoRA adapter).