--- license: apache-2.0 datasets: - HuggingFaceFW/fineweb-edu - EleutherAI/the_pile_deduplicated - HuggingFaceTB/dclm-edu - HuggingFaceTB/finemath - HuggingFaceTB/smollm-corpus - wikimedia/wikipedia - Harley-ml/lesswrong - Harley-ml/HFMC - AxiomicLabs/NPset-2-Python-Edu language: - en tags: - er - fromziro - fromzero - harley-ml - lyjonathan - small - slm - orez - sfz --- **Note**: This model belongs to the **Er** SLM family. All models in the Er family are trained using the same tokenizer, dataset, and token count. # Er-Medium ## Summary ``` Task: Text-Generation Total training time: 5 days Inputs: text Outputs: text Params: 12,497,520 Final Loss: 2.404 Important Benchmark Scores: 1. ARC Easy - 34.89% 2. BLiMP - 75.94% 3. HellaSwag - 28.39% 4. ArithMark-2.0 - 30.88% Framework: PyTorch, transformers Author: Paul Courneya, Jonathan Ly ``` ## Description ‘Er-Medium’ is a 12.5M-parameter Small Language Model trained on 34.8B tokens from a nine-source dataset. Its name, “Er,” is the reverse of “Re,” the prefix of Re:Zero – Starting Life in Another World, the light novel series that inspired the organization’s name. ## Model Details - Architecture: Qwen3.5 - Hidden Size: 280 - Number of Layers: 12 - Intermediate Size: 840 (a 3x expansion) - Number of Attention Heads: 8 - Number of KV Heads: 2 - Head Dim: 35 - Vocab Size: 2564 - Max Position Embeddings: 384 - Total Parameters: 12,497,520 ## Training ### Dataset | Source | Bytes (GB) | Share (%) | What it is | | ---------------- | ---------: | --------: | ----------------------------------------------- | | FineWeb-edu | 35.0 | 28.2% | Educational-filtered Common Crawl | | DCLM-Edu | 20.0 | 16.1% | Educational-filtered webtext | | The Pile Deduped | 20.0 | 16.1% | Broad, diverse 23-source dataset | | FineWeb-HQ | 20.0 | 16.1% | Knowledge-filtered webtext | | FineMath | 13.0 | 10.5% | Math-filtered Common Crawl | | Cosmopedia-v2 | 7.0 | 5.6% | Synthetic textbooks | | Wikipedia | 5.0 | 4.0% | Wikipedia articles | | NpSetPython-Edu | 3.5 | 2.8% | Normalized Python code | | Misc | 0.6 | 0.5% | LessWrong + HF configs + HF dataset/model cards | ### Training Details - Maximum Learning Rate: 3e-3 - Minimum Learning Rate: 0 - Number of Epochs: 1 - Sequence Length: 384 - Batch Size: 150 - Eval Split Ratio: 0.0025 - Gradient Accumulation Steps: 2 - Gradient Checkpointing: True - Gradient Clipping: 1.0 - Torch Compile: True - Torch Compile Mode: `max-autotune-no-cudagraphs` - AdamW Betas: `(0.9, 0.95)` - WSD Warmup Ratio: 0.015 - WSD Stable Ratio: 0.685 - WSD Decay Ratio: 0.30 - DType: `float16` ### Final Eval and Train Loss - Train: 2.404 - Val: 2.403 ### Hardware - GPU: NVIDIA RTX 5060 (used for training) - CPU: AMD Ryzen 5 2600 (used for tokenization) ## Benchmark scores | Task | Value | | ------------- | -----: | | BLiMP | 75.94% | | ARC Challenge | 20.65% | | ARC Easy | 34.89% | | BoolQ | 51.80% | | HellaSwag | 28.39% | | PiQA | 57.78% | | SciQ | 59.10% | | SWAG | 41.60% | | Winogrande | 49.01% | ArithMark-2.0: | Category | Accuracy | | -------- | -------: | | ops = 1 | 30.08% | | ops = 2 | 35.47% | | ops = 3 | 26.60% | | Avg | 30.88% | For a comparison with other small language models like this one, go [here](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard). ## Generation Sample ```text Prompt : 'Artificial intelligence is' ------------------------------------------------------------ Generated: a form of biomedical research that has been fundamentally and intellectually revolutionary in the past decade. The first major advancement in artificial intelligence was the invention of computers, which were based on digital computer science and computational software, and nowadays we’re still working with machines as well as other languages. This is what’s happening in medicine today: this new technology enables us to get more information about how we can better understand human-like behaviour through our own imagination. Currently, computer scientists have been studying the future of artificial intelligence for nearly 20 years. They are investigating how the world’s people actually look at their bodies and their environment and why they see them and how it works. As a result, they have become increasingly interested in the way we think about the future of the mind and the world around us. Most of these artificial intelligences are not physically active, but are seen in their own right. So, ``` ## Use Cases 1. Educational work and research 2. Fine-tuning for downstream use 3. Deployment on edge devices 4. Or just for fun. ## Limitations 1. Cannot chat, reason, code, or answer questions 2. Almost always unfactual 3. No long-context handling ## License Before using, distributing, selling, or modifying this software, you must read the license [here](https://huggingface.co/fromziro/Er-Medium-12.5M/blob/main/LICENSE.txt). ## Inference ```python #!/usr/bin/env python3 MODEL_DIR = "fromziro/Er-Medium-12.5M" TOKENIZER_PATH = MODEL_DIR PROMPT = "Artificial intelligence is" MAX_NEW_TOKENS = 256 TEMPERATURE = 0.7 TOP_P = 0.95 TOP_K = 30 REPETITION_PENALTY = 1.2 DO_SAMPLE = True import torch from pathlib import Path from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerFast device = ( "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" ) print(f"Device : {device}") def load_tokenizer(path_or_repo: str): p = Path(path_or_repo) if p.exists() and p.is_file() and p.suffix.lower() == ".json": tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve())) else: tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True) if tok.bos_token is None: tok.add_special_tokens({"bos_token": "<|bos|>"}) if tok.eos_token is None: tok.add_special_tokens({"eos_token": "<|eos|>"}) if tok.unk_token is None: tok.add_special_tokens({"unk_token": "<|unk|>"}) if tok.pad_token is None: tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>" tok.padding_side = "left" return tok print("Loading tokenizer...") tokenizer = load_tokenizer(TOKENIZER_PATH) print(f" Vocab size : {len(tokenizer)}") print(f" BOS : {tokenizer.bos_token!r}") print(f" EOS : {tokenizer.eos_token!r}") print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})") print(f"\nLoading model from {MODEL_DIR} ...") model = AutoModelForCausalLM.from_pretrained( MODEL_DIR, torch_dtype=torch.float16 if device == "cuda" else torch.float32, low_cpu_mem_usage=True, ) model.eval() model.to(device) model.config.use_cache = False if hasattr(model, "generation_config") and model.generation_config is not None: model.generation_config.use_cache = False total_params = sum(p.numel() for p in model.parameters()) print(f" Parameters : {total_params:,}") def generate( prompt: str = PROMPT, max_new_tokens: int = MAX_NEW_TOKENS, temperature: float = TEMPERATURE, top_p: float = TOP_P, top_k: int = TOP_K, repetition_penalty: float = REPETITION_PENALTY, do_sample: bool = DO_SAMPLE, ) -> str: bos = tokenizer.bos_token or "" full_prompt = bos + prompt inputs = tokenizer( full_prompt, return_tensors="pt", add_special_tokens=False, ).to(device) inputs.pop("token_type_ids", None) gen_kwargs = dict( max_new_tokens=max_new_tokens, do_sample=do_sample, repetition_penalty=repetition_penalty, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, use_cache=False, ) if do_sample: gen_kwargs["temperature"] = temperature gen_kwargs["top_p"] = top_p gen_kwargs["top_k"] = top_k with torch.inference_mode(): output_ids = model.generate(**inputs, **gen_kwargs) prompt_len = inputs["input_ids"].shape[-1] new_ids = output_ids[0][prompt_len:] return tokenizer.decode(new_ids, skip_special_tokens=True) if __name__ == "__main__": print(f"\nPrompt : {PROMPT!r}") print("-" * 60) output = generate(PROMPT) print("Generated:") print(output) ``` ## Copyright ``` Copyright (c) 2026 FromZero Copyright (c) 2026 Paul Courneya Copyright (c) 2026 Jonathan LY ``` ## Citation ```bibtex @misc{er-medium-12.5m, title = {Er-Medium-12.5M}, organization = [FromZero], authors = {Paul Courneya, Jonathon LY}, year = {2026}, url = {https://huggingface.co/fromziro/Er-Medium-12.5] } ```