Qwen2.5-Coder-7B-Instruct — OctoBench-2.2k Fine-tune

Fine-tuned version of Qwen2.5-Coder-7B-Instruct trained on OctoBench-2.2k — synthetic coding dataset generated with Dataset Generator.

Benchmark Results

Benchmark Base This model (FT on OctoBench-2.2k) Δ
HumanEval (5 runs avg, n=164, t=0.2) 55.5% (±2.1) 72.3% (±2.0) +16.8pp
HumanEval+ (5 runs avg, n=164, t=0.2) 49.0% (±1.9) 65.1% (±1.6) +16.1pp
BigCodeBench full instruct (1 run, n=1140) 39.3% 39.7% +0.4pp
LiveCodeBench v6 (1 run, n=1055, t=0.0) 29.0% 26.9% -2.1pp

+16.8pp on HumanEval, +16.1pp on HumanEval+ vs base — error bars don't overlap, statistically significant improvement. BigCodeBench essentially flat and LiveCodeBench shows a small regression — see Limitations below.

Benchmark

Training

  • Method: QLoRA fine-tuning via Unsloth
  • Base model: Qwen2.5-Coder-7B-Instruct
  • Dataset: OctoBench-2.2k (2,248 multi-turn examples)
  • Hardware: RTX 4070 Ti 12GB
  • Quantization: Q4_K_M GGUF (quantized by Unsloth)
  • Chat template: ChatML (embedded in GGUF)
  • Context length: 32,768 tokens
  • Evaluation: HumanEval / HumanEval+ (5 runs avg @ temp 0.2), BigCodeBench full instruct (1 run, calibrated), LiveCodeBench v6 (1 run @ temp 0.0)

Training logs and exact hyperparameters were not preserved — this was an exploratory fine-tune.

Training Data

Trained on OctoBench-2.2k — 2,248 multi-turn conversations across 8 coding categories:

  • Function Implementation
  • Algorithmic Problems
  • Python Stdlib & Idioms
  • Data Libraries
  • Edge Cases & Input Validation
  • Refactor & Code Review
  • Testing & Debugging
  • File IO Subprocess Concurrency

Limitations

  • Strong on function-level coding — measurable +16pp gains on HumanEval / HumanEval+
  • Weak on multi-library API tasks — BigCodeBench essentially flat (+0.4pp). The "Data Libraries" category was too generic; for BCB-style benchmarks, train on an API-precise dataset seeded with concrete library taxonomy
  • Slight regression on contest-style problems — LiveCodeBench v6 -2.1pp. Root cause is logic, not format (614 wrong-answer / 117 runtime-error / 40 time-limit-exceeded out of 771 fails). The algorithmic category needed more drill on edge-case coverage and constraint handling
  • Multi-turn conversational style — produces explanations alongside code

Support

If this helped you:

  • Ko-fi: https://ko-fi.com/arondaron
  • ETH: 0xA6910bDa2a89ee38cA42883e365BB2DdFba3C2A1
  • BTC: bc1qamarkursch3x8399qaly4md32ck5xgthnr9jpl
  • SOL: 797jTzFRm9dd4joHPqvUjryeXi5rPbMwG6Rqj3wJrgMt

License

Apache-2.0 — inherited from base model Qwen2.5-Coder-7B-Instruct.

Built with Dataset Generator.

Downloads last month
108
GGUF
Model size
8B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k-Fine-tune

Base model

Qwen/Qwen2.5-7B
Quantized
(190)
this model

Dataset used to train AronDaron/Qwen2.5-Coder-7B-Instruct-OctoBench-2.2k-Fine-tune