--- license: cc-by-4.0 datasets: - allenai/c4 language: - en metrics: - accuracy base_model: - deepseek-ai/DeepSeek-R1-Distill-Qwen-32B pipeline_tag: text-generation tags: - gptq - int4 - quantized - qlora - medical - medqa - biology - chemistry - finance - legal - climate - reasoning - 4-bit model-index: - name: Chaperone-Thinking-LQ-1.0 results: - task: type: text-generation name: Medical QA dataset: name: MedQA type: medqa metrics: - type: accuracy value: 84.0 - task: type: text-generation name: Math Reasoning dataset: name: MATH-500 type: math-500 metrics: - type: accuracy value: 91.9 - task: type: text-generation name: Math Competition dataset: name: AIME 2024 type: aime-2024 metrics: - type: accuracy value: 66.7 - task: type: text-generation name: Graduate-Level QA dataset: name: GPQA Diamond type: gpqa-diamond metrics: - type: accuracy value: 56.7 - task: type: text-generation name: Knowledge Understanding dataset: name: MMLU type: mmlu metrics: - type: accuracy value: 85.9 - task: type: text-generation name: Math Reasoning dataset: name: GSM8K-Platinum type: gsm8k metrics: - type: accuracy value: 84.04 - task: type: text-generation name: Instruction Following dataset: name: IFEval type: ifeval metrics: - type: accuracy value: 83.34 - task: type: text-generation name: Knowledge Understanding dataset: name: MMLU-PRO type: mmlu-pro metrics: - type: accuracy value: 65.76 --- # Chaperone-Thinking-LQ-1.0 A domain-optimized reasoning model built on **DeepSeek-R1-Distill-Qwen-32B**, refined through a multi-stage pipeline of GPTQ quantization-aware training and QLoRA fine-tuning. Achieves **84% on MedQA** — within 4 points of GPT-4o — in a ~20GB package that fits on a single L40/L40s GPU. **Fully open-source under CC-BY-4.0.** --- ## Highlights - **Base model:** DeepSeek-R1-Distill-Qwen-32B (32B parameters) - **Size reduction:** ~60GB → ~20GB (4-bit GPTQ) - **MedQA accuracy:** 84% (GPT-4o: ~88%) - **Hardware target:** Runs on a single NVIDIA L40, L40s, or A100 GPU - **License:** CC-BY-4.0 --- ## How We Built It This model is **not** a simple quantization. It was produced through a four-stage pipeline: | Stage | Method | What it does | |-------|--------|-------------| | **1. Quantization** | 4-bit GPTQ | Compresses weights from ~60GB to ~20GB for efficient inference | | **2. Quantization-Aware Training** | GPTQ-based QAT with calibration | Minimizes accuracy loss during quantization by optimizing scale/zero-point parameters against a calibration dataset | | **3. Domain Fine-Tuning** | QLoRA | Adapts the quantized model on medical and scientific corpora, recovering and improving accuracy for domain-specific reasoning | | **4. Transparency** | Adaptive layer removal | Removes the identity adaptive layer so the model correctly attributes its foundational architecture to its original creators | --- ## Benchmark Results ### MedQA | Model | Accuracy | |-------|----------| | **Chaperone-Thinking-LQ-1.0** | **84%** | | GPT-4o | 88% | ### Multi-Model Comparison | Benchmark | DeepSeek-R1 | OpenAI-o1-1217 | DeepSeek-R1-32B | OpenAI-o1-mini | **Chaperone-Thinking-LQ-1.0** | |-----------|:-----------:|:--------------:|:---------------:|:--------------:|:----------------------------:| | **AIME 2024** | 79.8 | 79.2 | 72.6 | 63.6 | **66.7** | | **GPQA Diamond** | 71.5 | 75.7 | 62.1 | 60.0 | **56.7** | | **MATH-500** | 97.3 | 96.4 | 94.3 | 90.0 | **91.9** | | **MMLU** | 90.8 | 91.8 | 87.4 | 85.2 | **85.9** | > Chaperone-Thinking-LQ-1.0 delivers competitive performance against full-precision frontier models at ~3x smaller model size. ### Speed & Latency | Metric | Chaperone-Thinking-LQ-1.0 | DeepSeek-R1-Distill-Qwen-32B | |--------|--------------------------|------------------------------| | Throughput | **36.86 tok/s** | 22.84 tok/s | | Latency p50 | **11.49s** | 20.10s | | Latency p95 | **13.06s** | 20.11s | > 1.6x higher throughput with ~43% lower median latency. > Averages over 10 trials, concurrency=1, max_tokens=512, temperature=0. --- ## Model Details | | | |---|---| | **Base model** | DeepSeek-R1-Distill-Qwen-32B | | **Parameters** | 32 billion | | **Quantization** | 4-bit GPTQ | | **Fine-tuning** | QLoRA on medical/scientific corpora | | **Model size** | ~20GB | | **Precision** | torch.float16 | | **Evaluation hardware** | NVIDIA A100 80GB PCIe | | **CUDA** | 12.4 | | **PyTorch** | 2.6.0+cu124 | --- ## Intended Use - Medical and clinical reasoning tasks - Scientific Q&A and research workflows - Enterprise deployments requiring data sovereignty (on-premises, private cloud) - Domain-specific text analysis and insight extraction --- ## Limitations - 4-bit quantization introduces some accuracy trade-off on general benchmarks vs. the full-precision base model - Domain fine-tuning is optimized for medical/scientific reasoning; general-purpose performance may differ - Not intended as a replacement for professional medical judgment --- ## Citation If you use this model, please cite: ```bibtex @misc{chaperone-thinking-lq, title={Chaperone-Thinking-LQ-1.0: Domain-Optimized Reasoning via GPTQ-QAT and QLoRA}, author={Empirisch Technologies}, year={2025}, url={https://huggingface.co/empirischtech} } ``` --- ## Links - **Website:** [chaperoneai.net](https://chaperoneai.net/benchmark) - **Hugging Face:** [[empirischtech](https://huggingface.co/empirischtech)](https://empirischtech.at/)