--- language: - en license: apache-2.0 base_model: - Qwen/Qwen2.5-1.5B-Instruct tags: - text-generation - qwen2 - unsloth - lora - gguf - llama.cpp - reasoning - distillation - conversational pipeline_tag: text-generation library_name: transformers datasets: - EphAsad/QWENMillenium-SF - EphAsad/Phi4Millennium-SF - EphAsad/MistralMillenium-SF - Modotte/CodeX-2M-Thinking - Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned - WithinUsAI/MiniMax_M2.7_Distilled_5k - tuanha1305/DeepSeek-R1-Distill - open-r1/OpenThoughts-114k-math - flytech/python-codes-25k - FreedomIntelligence/medical-o1-reasoning-SFT model-index: - name: Atem v1 results: - task: type: text-generation name: Text Generation dataset: name: ARC-Challenge type: ai2_arc config: ARC-Challenge split: test metrics: - type: acc_norm value: 0.455 name: Accuracy (normalised) verified: false - task: type: text-generation name: Text Generation dataset: name: GSM8K type: gsm8k split: test metrics: - type: exact_match value: 0.530 name: Exact Match (strict, zero-shot) verified: false - task: type: text-generation name: Text Generation dataset: name: HellaSwag type: hellaswag split: validation metrics: - type: acc_norm value: 0.644 name: Accuracy (normalised) verified: false ---

Atem Logo

Atem v1

Ancient logic. Modern intelligence.

A 1.5B reasoning model trained via multi-source knowledge distillation from frontier teacher models.

Base Model Method Parameters License

--- ## Overview Atem is a 1.5B parameter reasoning model built via supervised fine-tuning on a curated corpus of approximately 115,000 examples distilled from multiple frontier teacher models. Starting from Qwen2.5-1.5B-Instruct, Atem was trained using LoRA to preserve base model capabilities while improving performance on reasoning, mathematics, and coding tasks. This is **Stage 1** of a planned multi-stage training series. Stage 1 focuses on establishing strong general reasoning across domains. Stage 2 layers chain-of-thought thinking traces on top of this foundation. Stage 2 is [Atem-Wisdom](https://huggingface.co/EphAsad/Atem-Wisdom-1.5B) which builds on this foundation by adding explicit chain-of-thought reasoning — the model works through problems inside tags before producing its final answer. --- ## Model Details | Property | Value | |----------|-------| | **Base model** | Qwen/Qwen2.5-1.5B-Instruct | | **Training method** | LoRA Supervised Fine-Tuning (Stage 1) | | **LoRA config** | r=32, alpha=64, dropout=0.05 | | **Target modules** | q, k, v, o, gate, up, down projections | | **Parameters** | ~1.54B | | **Training records** | ~114,932 | | **Epochs** | 1 | | **Effective batch size** | 64 (batch 8 × grad accum 8) | | **Learning rate** | 2e-4, cosine schedule, 5% warmup | | **Final train loss** | 0.940 | | **Final val loss** | 0.890 | | **Hardware** | NVIDIA A100-SXM4 80GB | | **Max sequence length** | 4,096 tokens | | **Precision** | bfloat16 | | **License** | Apache 2.0 | --- ## Intended Use Atem is designed for open-ended reasoning tasks where structured, accurate thinking adds value: - Code explanation, implementation, and debugging - Mathematical problem solving with working shown - Analytical reasoning and hypothesis evaluation - Concept explanation and comparative analysis - Logic, argument, and fallacy identification Atem is **not** designed for retrieval-heavy factual lookup, real-time information, or tasks requiring broad knowledge breadth beyond its training domains. --- ## Training Data Atem was trained on a corpus assembled from eleven sources, combining domain-specific generated datasets and publicly available distillation datasets from frontier models. All outputs containing `` reasoning traces were stripped to clean final responses for Stage 1 training. | Dataset | Records | Source / Teacher | |---------|---------|-----------------| | EphAsad/QWENMillenium-SF | 5,000 | Qwen2.5-14B — Analytical & Scientific | | EphAsad/Phi4Millennium-SF | 2,932 | Phi-4 14B — Mathematical Reasoning | | EphAsad/MistralMillenium-SF | 5,000 | Mistral-Nemo-12B — Language & Comprehension | | Modotte/CodeX-2M-Thinking | 30,000 | Mixed — Coding | | Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | 23,000 | Kimi K2.5 — General Distillation (English filtered) | | WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | MiniMax M2.7 | | tuanha1305/DeepSeek-R1-Distill | 9,000 | DeepSeek-R1 | | open-r1/OpenThoughts-114k-math | 10,000 | Mixed — Mathematics (correct answers only) | | flytech/python-codes-25k | 10,000 | Python coding | | FreedomIntelligence/medical-o1-reasoning-SFT | 10,000 | Medical reasoning (English config) | | Private dataset | 5,000 | Undisclosed | | **Total** | **~114,932** | | The QWENMillenium-SF, Phi4Millennium-SF, and MistralMillenium-SF datasets were generated specifically for this project via batched inference on Colab A100. OpenThoughts-114k-math was filtered to verified correct solutions only before sampling. --- ## Training Configuration ```python # Key hyperparameters lora_r = 32 lora_alpha = 64 lora_dropout = 0.05 max_seq_length = 4096 learning_rate = 2e-4 lr_scheduler = 'cosine' warmup_ratio = 0.05 batch_size = 8 grad_accumulation = 8 # effective batch size: 64 num_epochs = 1 dtype = bfloat16 load_in_4bit = True # during training ``` Training used Unsloth with `train_on_responses_only` masking, ensuring loss was computed exclusively on assistant response tokens. A three-part pre-training validation was run before training: chat template replacement verification, think tag strip confirmation, and mask sanity check. After training, LoRA adapters were merged into the base weights and exported as a full merged model. **Loss curve:** | Step | Train Loss | Val Loss | |------|-----------|----------| | 500 | 0.990 | 0.920 | | 1000 | 1.020 | 0.900 | | 1500 | 0.960 | 0.890 | | Final | **0.940** | **0.890** | Validation loss converged at 0.890, with a final train/val gap of 0.050 — indicating no overfitting over the single epoch. --- ## Evaluation ### Benchmark Results Evaluated against Qwen2.5-1.5B-Instruct (base model) using lm-evaluation-harness with identical conditions: 4-bit inference, batch size 16, zero-shot strict evaluation. | Task | Base (1.5B) | Atem v1 (1.5B) | Delta | |------|------------|----------------|-------| | ARC-Challenge | 43.7% | 45.5% | +1.8% ✓ | | GSM8K | 23.0% | **53.0%** | **+30.0%** ✓ | | HellaSwag | 66.8% | 64.4% | -2.4% | The GSM8K result is the primary finding. A +30 percentage point improvement on grade school mathematics reflects the targeted training on verified correct mathematical reasoning examples from multiple frontier teacher models. The HellaSwag regression of 2.4% is within normal benchmark variance and represents a significant improvement over a prior exploratory training run using full fine-tune, which produced a 16.2% regression on the same benchmark. LoRA preserved base model commonsense capabilities as intended. ### Comparison vs Qwen2.5-7B-Instruct To contextualise the GSM8K result, Atem was benchmarked against Qwen2.5-7B-Instruct under the same zero-shot strict evaluation conditions. | Model | Parameters | GSM8K (zero-shot strict) | |-------|-----------|--------------------------| | Qwen2.5-1.5B-Instruct | 1.5B | 23.0% | | **Atem v1** | **1.5B** | **53.0%** | | Qwen2.5-7B-Instruct | 7B | 74.9% | At baseline, the 1.5B model sits 51.9 points below the 7B. After training, Atem sits 21.9 points below — closing approximately **58% of the capability gap** between 1.5B and 7B on mathematical reasoning. Atem achieves **71% of Qwen2.5-7B's GSM8K performance at 22% of its parameter count**. Note: Official Qwen2.5-7B-Instruct scores (91.6% GSM8K) use 4-shot chain-of-thought prompting. The 74.9% figure above reflects the same zero-shot strict evaluation format used for Atem, ensuring a fair direct comparison. ### Qualitative Evaluation Atem was evaluated against Qwen2.5-1.5B-Instruct across 30 domain-representative questions using matched system prompts, ensuring differences in output reflect trained capability rather than prompt engineering. | Domain | Questions | Outcome | |--------|-----------|---------| | Coding | 8 | Atem stronger — more thorough, better structured, catches edge cases | | Mathematics | 6 | Comparable — both accurate on standard problems | | Analytical Reasoning | 6 | Atem stronger — better structured arguments | | General Knowledge | 5 | Comparable | | Language & Logic | 5 | Atem stronger — correct fallacy identification, greater depth | --- ## Usage ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "EphAsad/Atem-v1-1.5B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) messages = [ { "role": "user", "content": "Write a Python function that checks whether a number is prime." } ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( input_ids=inputs, max_new_tokens=1000, temperature=0.7, top_p=0.9, repetition_penalty=1.1, do_sample=True, ) response = tokenizer.decode( output[0][inputs.shape[1]:], skip_special_tokens=True ) print(response) ``` ### Unsloth (faster inference) ```python from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name="EphAsad/Atem-v1-1.5B", max_seq_length=4096, dtype=torch.bfloat16, load_in_4bit=True, ) FastLanguageModel.for_inference(model) messages = [ { "role": "user", "content": "Explain the difference between a stack and a queue, with examples." } ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") with torch.no_grad(): output = model.generate( input_ids=inputs, max_new_tokens=1000, temperature=0.7, top_p=0.9, do_sample=True, ) print(tokenizer.decode( output[0][inputs.shape[1]:], skip_special_tokens=True )) ``` ### Ollama ```bash # Recommended — best speed/quality balance ollama run hf.co/EphAsad/Atem-v1-1.5B:Q4_K_M # Higher quality ollama run hf.co/EphAsad/Atem-v1-1.5B:Q5_K_M # Near-lossless ollama run hf.co/EphAsad/Atem-v1-1.5B:Q8_0 ``` ### llama.cpp ```bash llama-server -hf EphAsad/Atem-v1-1.5B:Q4_K_M ``` ### System Prompt Atem's identity is baked into the chat template and activates automatically when no system message is provided. For manual override: ``` You are Atem, a precise and analytical reasoning assistant. You approach every problem methodically — identifying core concepts, reasoning step by step, and arriving at well-supported conclusions. You show your thinking clearly and are thorough, direct, and intellectually honest. ``` ### Available Files | File | Size | Description | |------|------|-------------| | `model.safetensors` | ~3.1 GB | Full bfloat16 merged weights | | `Atem-1.5b.Q4_K_M.gguf` | ~986 MB | 4-bit quantised — recommended | | `Atem-1.5b.Q5_K_M.gguf` | ~1.1 GB | 5-bit quantised | | `Atem-1.5b.Q8_0.gguf` | ~1.6 GB | 8-bit quantised — near-lossless | --- ## Known Limitations **No thinking traces (Stage 1 by design).** Think tags were stripped from all training data for Stage 1. The model does not produce extended `` reasoning traces. Stage 2 training will layer this capability on top of the Stage 1 foundation. **Mathematical precision on complex problems.** On multi-step calculations, the model may make arithmetic slips in intermediate steps while arriving at a structurally correct approach. Answers to high-stakes mathematical problems should be independently verified. **HellaSwag regression.** A 2.4% regression on HellaSwag commonsense completion is observed. This is minor and substantially better than the 16.2% regression produced by the earlier exploratory full fine-tune run, confirming that LoRA preserved base commonsense capability effectively. --- ## Roadmap Atem v1 establishes the Stage 1 foundation. Planned next steps: - **Stage 2:** LoRA SFT on curated chain-of-thought data to add thinking trace capability — using `Complex_CoT`, `inverted_reasoning`, and reasoning trace columns held out from Stage 1 training - **Extended benchmarks:** MMLU, BBH, IFEval, WinoGrande, MBPP post-Stage 2 - **Atem v2:** Expanded corpus, further domain coverage --- ## Citation ```bibtex @misc{atem_v1_2026, author = {Asad, Zain}, title = {Atem v1: A 1.5B Reasoning Model via Multi-Source Knowledge Distillation}, year = {2026}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/EphAsad/Atem-v1-1.5B}}, } ``` --- ## Support If you find this model useful for your research or projects, you can support further development of my datasets and models here: ☕ [ko-fi.com/ephraim123](https://ko-fi.com/ephraim123) --- ## License Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base model Qwen2.5-1.5B-Instruct. ---

Built independently by EphAsad