--- language: - en license: apache-2.0 library_name: transformers tags: - text-generation - llama - small-language-model - efficient - edge-deployment - speculative-decoding - 150m-parameters - tpu-trained - research - low-resource - portimbria - gqa - fineweb pipeline_tag: text-generation datasets: - epfml/FineWeb-HQ - HuggingFaceTB/finemath - bigcode/starcoderdata - StentorLabs/Portimbria-150M-Vs.-SmolLM2-135M thumbnail: https://huggingface.co/StentorLabs/Portimbria-150M/resolve/main/thumbnail.png widget: - text: The history of artificial intelligence began example_title: History Continuation - text: 'def quicksort(arr):' example_title: Code Continuation - text: Once upon a time in a distant kingdom example_title: Story Generation - text: The laws of thermodynamics describe example_title: Science Continuation - text: Neural networks are computational models that example_title: Technical Explanation model_card_authors: - StentorLabs model-index: - name: Portimbria-150M results: - task: type: text-generation dataset: name: FineWeb-HQ (validation split) type: epfml/FineWeb-HQ metrics: - name: Best Validation Loss type: loss value: 2.8906 - name: Best Perplexity type: perplexity value: 18 ---
# Portimbria-150M ![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg) ![Model Size](https://img.shields.io/badge/parameters-151M-green.svg) ![Training Time](https://img.shields.io/badge/training-~8h-orange.svg) ![Hardware](https://img.shields.io/badge/hardware-TPU%20v5e--8-red.svg) ![Context Length](https://img.shields.io/badge/context-4096%20tokens-purple.svg) ![Vocab Size](https://img.shields.io/badge/vocab-32768%20tokens-blue.svg) ![Perplexity](https://img.shields.io/badge/PPL-18.00-brightgreen.svg) [![Hugging Face](https://img.shields.io/badge/πŸ€—-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs)
> πŸ”¬ **Research Artifact & Base Language Model.** Portimbria-150M is a next-token predictor β€” not a chat assistant. It has no safety tuning and should not be deployed in user-facing applications without fine-tuning first. It is, however, a high-quality open foundation: **fine-tune it, quantize it, convert it, distill from it, run LoRA on it, adapt it to your domain, or build anything else you can imagine** β€” and please publish your results! See [Intended Uses](#use-cases--intended-uses) for details. > πŸ’‘ **Built by a solo 14-year-old developer, on a laptop, for $0.** Every model StentorLabs has released β€” including this one β€” was conceived, designed, coded, and trained without a budget, a team, a GPU cluster, or institutional support. The total cost of producing Portimbria-150M was zero dollars, using free Kaggle TPU quota and publicly available datasets. This is what democratized AI research looks like. --- ## What Is This? **Portimbria-150M** is the first 150M-parameter model from StentorLabs and the inaugural entry in the **Portimbria** model family β€” a new scaling tier above the Stentor2 line. The name is a deliberate rearrangement of *Portia fimbriata*, a jumping spider famous for being extraordinarily intelligent relative to its tiny body size. That tension β€” compact but capable β€” is the design philosophy of this model family. At ~151M parameters, Portimbria-150M is a base causal language model trained entirely from scratch on free-tier Kaggle compute using a Google Cloud TPU v5e-8 (eight chips). It was trained on approximately **6 billion tokens** drawn from a web/code/math curriculum, with a 4096-token context window β€” the longest in the StentorLabs model lineup to date. Like all StentorLabs models, this is a **base next-token predictor**, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best suited for research, prototyping, speculative decoding, and infrastructure experiments. The key architectural differentiators from Stentor2-12M are: a **~12Γ— parameter scale-up** (12.3M β†’ 151M), a **4Γ— longer context window** (4096 vs 1024 tokens), **Grouped Query Attention** (6 query heads, 2 KV heads), and a standard **Mistral BPE vocabulary** (32,768 tokens) rather than a compact custom tokenizer. This enables full compatibility with the standard `AutoTokenizer` ecosystem. GQA training stability is worth noting: Stentor2-12M-Preview experienced minor training instability when GQA was first introduced, largely because at 12M parameters the model simply wasn't large enough to absorb the optimization pressure smoothly. At 151M parameters β€” more than 12 times larger β€” Portimbria-150M handled GQA training without issue. The benefits (smaller KV cache, faster inference, no quality loss) clearly outweigh the minor challenge that existed only at the 12M scale. --- ## πŸ“„ Comparative Analysis A full comparative analysis of Portimbria-150M vs. SmolLM2-135M β€” covering architecture, training data, scaling law positioning, benchmarks, and deployment characteristics β€” is available here: > πŸ”— **[Compact Ambitions: Portimbria-150M vs. SmolLM2-135M](https://huggingface.co/datasets/StentorLabs/Portimbria-150M-Vs.-SmolLM2-135M)** --- ## The Portimbria Name
Why "Portimbria"? *Portia fimbriata* is a species of jumping spider native to Queensland, Australia. It is considered one of the most cognitively sophisticated spiders ever studied β€” capable of problem-solving, planning, and learned behavior β€” yet it fits comfortably on a fingertip. The word "Portimbria" is a scrambled encoding of the species name, chosen to reflect the same principle: a model small enough to train for free, yet ambitious enough to compete meaningfully with models trained at far greater cost.
--- ## πŸ“‹ Table of Contents 1. [What Is This?](#what-is-this) 2. [The Portimbria Name](#the-portimbria-name) 3. [Model Architecture](#model-architecture) 4. [Head-to-Head: StentorLabs Model Family](#head-to-head-stentorLabs-model-family) 5. [Quick Start](#-quick-start) 6. [Memory Requirements](#memory-requirements) 7. [Important Limitations](#️-important-limitations) 8. [Honest Notices](#-honest-notices) 8. [Training Infrastructure](#training-infrastructure) 9. [Training Hyperparameters β€” Complete Reference](#training-hyperparameters--complete-reference) 10. [Precision Stability Recipe](#precision-stability-recipe) 11. [Data Pipeline](#data-pipeline) 12. [Weight Initialization](#weight-initialization) 13. [Evaluation & Results](#evaluation--results) 14. [Benchmark Results](#benchmark-results) 15. [Model Outputs](#model-outputs) 16. [Training Dynamics](#training-dynamics) 17. [Use Cases & Intended Uses](#use-cases--intended-uses) 18. [Out-of-Scope Uses](#out-of-scope-uses) 19. [Ethical Considerations & Societal Impact](#ethical-considerations--societal-impact) 20. [Inference Guide](#inference-guide) 21. [Free Inference β€” Try It Now](#-free-inference--try-it-now) 22. [Quantization](#quantization) 23. [Community Contributions](#-community-contributions--build-on-this-model) 24. [Format Conversion](#format-conversion) 25. [Speculative Decoding](#speculative-decoding) 26. [Bias, Risks & Limitations](#bias-risks--limitations) 27. [Related Work](#related-work) 28. [Environmental Impact](#environmental-impact) 29. [Citation](#citation) --- ## Model Architecture Portimbria-150M is a `LlamaForCausalLM` model with Grouped Query Attention (GQA), a 32,768-token Mistral BPE vocabulary, and a 4096-token context window. | Component | Value | Notes | |---|---|---| | **Architecture** | `LlamaForCausalLM` | Standard transformer decoder | | **Hidden Size** | 768 | | | **Intermediate Size (FFN)** | 2,048 | SwiGLU activation | | **Num Hidden Layers** | 20 | | | **Num Attention Heads** | 6 | | | **Num Key/Value Heads** | 2 | GQA β€” 3:1 query-to-KV ratio | | **Context Length** | 4,096 tokens | | | **Vocab Size** | 32,768 | Mistral BPE | | **Total Parameters** | 151,026,432 | | | **Positional Encoding** | RoPE | `rope_theta = 50,000.0` |
Full architecture spec, GQA explanation & parameter count breakdown ### Full Core Configuration | Component | Value | Notes | |---|---|---| | **Architecture** | `LlamaForCausalLM` | Standard transformer decoder | | **Hidden Size** | 768 | | | **Intermediate Size (FFN)** | 2,048 | Hidden Γ— 2.67 (SwiGLU with 3 matrices) | | **Num Hidden Layers** | 20 | | | **Num Attention Heads** | 6 | | | **Num Key/Value Heads** | 2 | GQA β€” 3:1 query-to-KV ratio | | **Head Dimension** | 128 | 768 Γ· 6 β€” TPU v5e optimal | | **KV Dimension** | 256 | 768 Γ— (2/6) | | **Vocab Size** | 32,768 | Mistral BPE, padded to multiple of 128 | | **Max Position Embeddings** | 4,096 | `block_size` in training script | | **Hidden Activation** | SiLU | LlamaForCausalLM default | | **Positional Encoding** | RoPE | `rope_theta = 50,000.0` | | **RMS Norm Epsilon** | 1e-5 | | | **Tie Word Embeddings** | True | Shared embedding / LM head | | **Attention Bias** | False | | | **MLP Bias** | False | | | **Attention Implementation** | SDPA | PyTorch Scaled Dot Product Attention | ### Why GQA? Grouped Query Attention (6Q, 2KV) reduces the KV cache memory footprint by 67% at inference time compared to standard Multi-Head Attention at the same hidden size. At a 4096-token context window this matters substantially: the KV cache for a single sequence is proportional to `2 Γ— num_kv_heads Γ— head_dim Γ— num_layers Γ— seq_len`. With 2 KV heads instead of 6, the cache shrinks to one-third of its full-MHA equivalent, enabling longer generation on memory-constrained hardware. ### Parameter Count Breakdown ```python def estimate_llama_params_gqa(vocab_size, hidden_size, intermediate_size, num_hidden_layers, num_attention_heads, num_key_value_heads): kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads) q_proj = hidden_size * hidden_size k_proj = hidden_size * kv_dim v_proj = hidden_size * kv_dim o_proj = hidden_size * hidden_size attn = q_proj + k_proj + v_proj + o_proj mlp = 3 * hidden_size * intermediate_size # gate, up, down norm = 2 * hidden_size # input + post-attention RMSNorm total = vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size return total ``` Plugging in Portimbria-150M values: ``` kv_dim = 768 Γ— (2/6) = 256 q_proj = 768 Γ— 768 = 589,824 k_proj = 768 Γ— 256 = 196,608 v_proj = 768 Γ— 256 = 196,608 o_proj = 768 Γ— 768 = 589,824 attn/layer = 1,572,864 mlp/layer = 3 Γ— 768 Γ— 2,048 = 4,718,592 norm/layer = 2 Γ— 768 = 1,536 per_layer = 6,292,992 embedding = 32,768 Γ— 768 = 25,165,824 layers = 20 Γ— 6,292,992 = 125,859,840 final_norm = 768 total = 25,165,824 + 125,859,840 + 768 = 151,026,432 βœ“ ``` | Component | Parameters | % of Total | |---|---|---| | Embedding Table (tied with LM Head) | 25,165,824 | 16.7% | | Transformer Layers Γ— 20 | 125,859,840 | 83.3% | | β€” Attention (per layer Γ— 20) | 31,457,280 | 20.8% | | β€” FFN/MLP (per layer Γ— 20) | 94,371,840 | 62.5% | | β€” Layer Norms (per layer Γ— 20) | 30,720 | 0.02% | | Final RMS Norm | 768 | 0.001% | | **Total** | **151,026,432** | **100%** | With a standard 32K vocabulary, embedding takes only 16.7% of the parameter budget β€” leaving 83.3% for the transformer stack that actually learns language patterns. This represents a healthy allocation at this scale, especially with GQA dramatically cutting the attention head count without sacrificing hidden dimension depth.
--- ## Head-to-Head: StentorLabs Model Family
Comparison table vs Stentor2-12M and Stentor2-30M | Property | Stentor2-12M | Stentor2-30M | **Portimbria-150M** | |---|---|---|---| | **Vocabulary** | 8,064 (TokenMonster) | 8,064 (TokenMonster) | **32,768 (Mistral BPE)** | | **Hidden Size** | 256 | 512 | **768** | | **Intermediate Size** | 512 | 1,024 | **2,048** | | **Num Layers** | 12 | 10 | **20** | | **Attention Heads** | 4 | 8 | **6** | | **KV Heads** | 4 (MHA) | 8 (MHA) | **2 (GQA)** | | **Head Dimension** | 64 | 64 | **128** | | **Context Length** | 1,024 | 1,024 | **4,096** | | **Total Parameters** | 12.3M | 30.4M | **151.0M** | | **Embedding Share** | 16.8% | 13.6% | **16.7%** | | **Training Tokens** | 480M | 800M | **~6B** | | **Training Hardware** | 2Γ— T4 | 2Γ— T4 | **TPU v5e-8** | | **Training Time** | ~5h | ~6.75h | **~8h** | | **Best Perplexity** | 26.61 | 18.07 | **18.00** | | **Tokenizer** | TokenMonster | TokenMonster | **Mistral BPE** | > **Cross-family comparison caveat:** PPL values are not directly comparable across families for two compounding reasons. First, Stentor2 models use TokenMonster (8K vocab) while Portimbria-150M uses Mistral BPE (32K vocab) β€” different tokenizers produce different token spaces and therefore different raw perplexity scales. Second, and more importantly, the Stentor1 family was trained exclusively on **Cosmopedia + FineWeb-Edu**, and the Stentor2 family on **StenCore-PDF + FineWeb-HQ** β€” both purely web/document text with **zero code or math**. Portimbria-150M is the first StentorLabs model trained on a **web + code + math curriculum** (FineWeb-HQ 75%, StarCoderData 15%, FineMath-4+ 10%). The harder, more structured distributions of code and math raise the effective loss target, meaning a direct PPL comparison against any prior StentorLabs model significantly understates Portimbria-150M's real capability improvement.
--- ## Memory Requirements How much VRAM you need depends on precision and whether you're generating (which activates the KV cache). The table below covers a single sequence at full 4096-token context β€” KV cache scales linearly, so at 1024 tokens it's roughly ΒΌ of the values shown. | Precision | Weights | KV Cache (4096 ctx) | Total VRAM | |---|---|---|---| | FP32 | ~604 MB | ~160 MB | ~764 MB | | FP16 / BF16 | ~302 MB | ~80 MB | ~382 MB | | INT8 | ~151 MB | ~80 MB | ~231 MB | | INT4 | ~76 MB | ~80 MB | ~156 MB | > **KV cache note:** GQA (2 KV heads) already reduces the KV cache by 67% vs standard MHA at the same hidden size β€” the figures above reflect this. Formula: `2 (K+V) Γ— 2 (KV heads) Γ— 128 (head_dim) Γ— 20 (layers) Γ— seq_len Γ— bytes_per_element`. > **Weights note:** Weights are saved as FP32 in safetensors. Cast on load with `torch_dtype=torch.float16` or `torch_dtype=torch.bfloat16` to halve weight memory. INT8/INT4 figures require bitsandbytes quantization as shown in the [Quantization](#quantization) section. --- ## πŸš€ Quick Start ### 1. Install Dependencies ```bash pip install transformers torch safetensors ``` ### 2. Load the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", torch_dtype=torch.float16, ) tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M") model = model.eval() ``` ### 3. Generate Text ```python prompt = "The history of computing began" input_ids = tokenizer.encode(prompt, return_tensors="pt").to(next(model.parameters()).device) attention_mask = torch.ones_like(input_ids) with torch.inference_mode(): output = model.generate( input_ids, attention_mask=attention_mask, max_new_tokens=150, do_sample=True, temperature=0.8, top_p=0.9, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id, ) generated = output[0][input_ids.shape[1]:] print(tokenizer.decode(generated, skip_special_tokens=True)) ``` > ℹ️ **No custom tokenizer required.** Portimbria-150M uses the Mistral BPE tokenizer via `AutoTokenizer`. No additional packages needed beyond `transformers`.
Pipeline usage & recommended generation settings ### 4. Using the Pipeline ```python from transformers import pipeline pipe = pipeline( "text-generation", model="StentorLabs/Portimbria-150M", torch_dtype=torch.float16, device_map="auto", ) result = pipe( "Neural networks are computational models", max_new_tokens=100, temperature=0.8, top_p=0.9, repetition_penalty=1.1, do_sample=True, ) print(result[0]["generated_text"]) ``` ### 5. Recommended Generation Settings | Parameter | Recommended Range | Notes | |---|---|---| | `temperature` | 0.5 – 0.8 | Lower values (0.5–0.6) give more coherent, on-topic output; higher values (0.7–0.8) give more variety. Stay below 1.0. | | `top_p` | 0.85 – 0.90 | This range prevents gibberish and completely random tokens without over-restricting word choice. | | `repetition_penalty` | 1.05 – 1.2 | Stops looping and over-repetition while keeping outputs high quality. The sweet spot is 1.1. | | `max_new_tokens` | 40 – 4096 | Depends entirely on your goal. For a quick definition or fact, 40–60 is enough. For a story or long document, use 2000–4096. | > **Temperature guidance:** Lower temperature keeps the model closer to its learned distribution and more likely to stay on topic. Higher temperature increases creativity and diversity at the cost of some coherence. > **max_new_tokens guidance:** Don't set this too low for creative tasks β€” the model often generates an EOS token and stops on its own before hitting the ceiling anyway. Setting a generous ceiling (e.g. 2000) for open-ended generation costs nothing if the model stops early.
--- ## ⚠️ Important Limitations - **Not Instruction-Tuned:** This is a base model. It will continue text, not follow instructions. - **No Safety Tuning:** No RLHF, no DPO, no content filtering. - **Limited Factual Reliability:** 151M parameters cannot store reliable world knowledge. - **Context Window:** Hard limit of 4,096 tokens. - **English Only:** Mistral BPE is heavily English-biased; other languages will tokenize poorly. - **Repetition Without Penalty:** Always use `repetition_penalty β‰₯ 1.05`. - **Shared Tensor Warning:** You may see `Removed shared tensor {'lm_head.weight'}` on save β€” this is expected from tied word embeddings and is safe to ignore. --- ## πŸ“‹ Honest Notices
10 candid first-hand observations about this model These are candid, first-hand observations about this model. 1. **Dramatically more fluent than Stentor2-30M β€” the gap is very large.** The difference in output quality is not subtle or marginal. Portimbria-150M reads like coherent, natural text at a level that makes Stentor2-30M look like a completely different tier of model. The jump between them is less like "a little better" and more like comparing a toddler learning words to a child speaking full, structured sentences β€” they're both small, but they're at fundamentally different stages of capability. The 4096-token context window also allows coherent extended passages that the 1024-token Stentor2 models simply cannot sustain. 2. **Standard Mistral BPE means plug-and-play compatibility.** No custom tokenizer packages. `AutoTokenizer` just works. 3. **Drifts much less than smaller models, and when it does drift it stays in the neighborhood.** Topic coherence is meaningfully better than prior StentorLabs models. When drift does occur, it tends to pull toward semantically adjacent territory rather than going completely off the rails β€” if you prompt about hiking and the model drifts, it will likely end up somewhere like swimming or biking, not something totally unrelated like space or finance. Cushion-topic drift (to closely related subjects) happens occasionally; completely random topic jumps are rare. 4. **Practically no gibberish under normal conditions.** Incoherent token sequences are extremely rare at any reasonable temperature setting. You would need to deliberately run the model thousands of times on confusing or adversarial prompts to reliably reproduce gibberish output. In ordinary use, real English words come out consistently. 5. **Code generation does not work β€” the model responds to code prompts in English instead.** Despite being trained on Python, JavaScript, and TypeScript from StarCoderData, the code corpus (~15% of the training mix, or roughly hundreds of millions of tokens) was far too small relative to the ~4.5B web-text tokens for code generation behavior to emerge. When prompted to write code, the model does not produce code β€” it produces English text instead, typically on a loosely related topic. Code prompts are not a supported use case for this model. 6. **Math reasoning is present but very weak β€” reliable arithmetic is absent.** The model cannot perform simple addition reliably. However, there is a meaningful difference from code: the model does recognize that math belongs to the domain of numbers, graphs, symbols, and equations. If you prompt it with `1 + 1 =`, it understands that a number should follow. It won't reliably get the right answer, but it knows it's doing math and responds accordingly β€” which is more than can be said for code (see above). Math-adjacent outputs (graphs, symbols, equations-like structure) appear appropriately in math contexts. Reliable symbolic computation is absent at this scale without instruction tuning or a much larger math token budget. 7. **GQA makes inference meaningfully faster.** Two KV heads vs six results in a significantly smaller KV cache, which matters most during long-context generation on memory-limited hardware. 8. **TPU training produces slightly different gradient dynamics than GPU.** BF16 on TPU has different rounding behavior than FP16 on GPU. The model was trained natively in BF16 and is provided in FP32 weights (as is standard practice for safetensors saves). 9. **The 4096-token context is real but untested at scale.** RoPE with `theta=50,000` was used throughout training at full block size. Position embeddings were exercised continuously, but very long-context generation quality has not been formally benchmarked. 10. **Strong topic grasp β€” it understands what you're asking about.** Even without instruction following, the model has a noticeably good sense of the domain of a prompt. Ask about a dog and a cat, and it will generate something about pets or a closely related subject β€” not something random like the universe or geopolitics. Earlier StentorLabs models (especially the Stentor2 line) were poor at this; Portimbria-150M handles it well in the vast majority of cases. Short prompts with very little context are the main exception β€” with nothing to anchor on, outputs will be more random. But with a moderately-sized prompt, the model reliably stays in the right conceptual neighborhood.
--- ## Training Infrastructure
Hardware, software stack & throughput details ### Hardware | Component | Specification | |---|---| | **Accelerator** | Google Cloud TPU v5e | | **Chip Configuration** | 8-chip pod slice (v5e-8) | | **Active Training Processes** | 8 (one per chip via torchrun + PJRT) | | **Global Batch Tokens/Step** | 262,144 (8 Γ— 4,096 Γ— 8 processes) | | **Platform** | Kaggle Notebooks (free tier) | | **Orchestration** | HuggingFace Accelerate + torchrun | | **Process Group Init** | `env://` (XLA backend) | ### Software Stack | Package | Role | |---|---| | PyTorch 2.6 | Core tensor operations | | torch_xla 2.6 | XLA/TPU backend | | HuggingFace Transformers | Model architecture (LlamaForCausalLM) | | HuggingFace Accelerate | Distributed training orchestration | | HuggingFace Datasets | Data loading and streaming | | safetensors | Model serialization | ### Throughput | Metric | Value | |---|---| | Average global tokens/sec | ~253,000 | | Per-chip tokens/sec | ~31,600 | | Total training tokens | ~6,000,000,000 | | Total wall-clock time (epoch) | 28,871s (~8.02h) |
--- ## Training Hyperparameters β€” Complete Reference
Full hyperparameter tables (optimizer, batch, schedule, checkpointing) ### Core Training Parameters | Hyperparameter | Value | Notes | |---|---|---| | `learning_rate` | 8e-4 | Peak AdamW LR | | `weight_decay` | 0.01 | Applied to Linear weights only | | `max_grad_norm` | 1.0 | Gradient clipping | | `optimizer` | AdamW | `betas=(0.9, 0.95)`, `eps=1e-8` | | `scheduler` | Cosine | With linear warmup | | `warmup_steps` | 1,144 | 5% of max_train_steps | | `stable_steps` | 18,311 | 80% of max_train_steps | | `max_train_steps` | 22,889 | Token budget reached first | | `token_budget` | 6,000,000,000 | Total training tokens | | `source_token_budget` | 6,000,000,000 | Source data token cap | | `seed` | 42 | | | `mixed_precision` | bf16 | Native TPU BF16 | ### Batch & Sequence Parameters | Hyperparameter | Value | Notes | |---|---|---| | `per_device_train_batch_size` | 8 | Per TPU chip | | `num_processes` | 8 | One per chip | | `total_batch_size` | 64 | 8 Γ— 8 | | `block_size` | 4,096 | Sequence / context length | | `tokens_per_optimizer_step` | 262,144 | `total_batch_size Γ— block_size` | | `gradient_accumulation_steps` | 1 | No accumulation | | `num_train_epochs` | 1 | Token budget exhausted within epoch 0 | | `pack` | True | Required for TPU static shapes | ### Evaluation & Checkpointing | Hyperparameter | Value | |---|---| | `eval_steps` | 1,000 | | `best_eval_steps` | 1,000 | | `best_eval_start_step` | 1,000 | | `max_eval_samples` | 5,000 | ### AdamW Optimizer β€” Detailed - **Decay group:** All `nn.Linear` weight matrices β†’ `weight_decay = 0.01` - **No-decay group:** Bias terms, normalization parameters, embedding parameters β†’ `weight_decay = 0.0` - **Betas:** `(0.9, 0.95)` - **Epsilon:** `1e-8` - **Fused kernel:** Enabled when CUDA available (not applicable on TPU) ### Learning Rate Schedule ``` Phase 1 β€” Warmup (steps 0–1,144): LR ramps linearly from 0 β†’ 8e-4 Phase 2 β€” Cosine Decay (steps 1,144–22,889): LR decays from 8e-4 β†’ 0 following a cosine curve ```
--- ## Precision Stability Recipe
FP32 norm patching, critical layer wrapping & recipe summary Training on TPU v5e in BF16 requires deliberate precision management to avoid gradient instabilities at 150M scale. ### 1. FP32 Normalization Layers (41 modules) All RMSNorm modules are monkey-patched to compute in FP32: ```python def _fp32_norm_forward(hidden_states, *args, _orig=original_forward, **kwargs): input_dtype = hidden_states.dtype output = _orig(hidden_states.float().contiguous(), *args, **kwargs) if torch.is_floating_point(output): output = output.to(input_dtype) return output ``` **Count:** 20 layers Γ— 2 norms each + 1 final norm = **41 modules total**. ### 2. FP32 Critical Layers (2 layers) The **first and last transformer layers** run their entire forward pass in FP32: - Weights remain in their training dtype; inputs are cast to `.float()` on entry - `torch.amp.autocast("cuda", enabled=False)` prevents re-downcasting **Rationale:** Boundary layers β€” where embeddings project in and logits project out β€” are most sensitive to numerical precision. Wrapping them in FP32 provides a stable floor at minimal compute cost. ### 3. FP32 Attention Softmax β€” Skipped Not applied. PyTorch SDPA handles softmax numerical stability internally and requires FP16/BF16 inputs for its optimized code paths on both CUDA and XLA. ### Recipe Summary | Technique | Count | Scope | |---|---|---| | FP32 norm modules | **41** | All RMSNorm layers | | FP32 critical layers | **2** | First + last transformer layers | | FP32 softmax modules | **0** | Skipped β€” SDPA incompatible |
--- ## Data Pipeline
Training data sources, curriculum design & preprocessing details Training used a **web/code/math curriculum** with the following source mix: | Source | Dataset | Ratio | |---|---|---| | Web | `epfml/FineWeb-HQ` (CC-MAIN-2024-51) | 75% | | Code | `bigcode/starcoderdata` (Python, JS, TypeScript) | 15% | | Math | `HuggingFaceTB/finemath` (finemath-4plus) | 10% | **Total tokens processed:** ~6,000,000,000 (single epoch over source data) ### Curriculum Design Training used a **curriculum anneal** over the final 15% of the token budget, upweighting code and math relative to web text. This front-loads web generalization while ensuring the model sees a higher concentration of structured/formal content near the end of training. ### Text Preprocessing ```python def clean_text(text: str, preserve_linebreaks: bool = False) -> str: text = unicodedata.normalize("NFKC", text) text = text.replace("\\r\\n", "\\n").replace("\\r", "\\n") if preserve_linebreaks: lines = [line.rstrip() for line in text.splitlines()] text = "\\n".join(lines).strip() else: lines = [line.strip() for line in text.splitlines() if line.strip()] text = " ".join(lines) text = " ".join(text.split()) return text ``` - **NFKC normalization** maps visually-equivalent Unicode to canonical form - **Linebreak preservation** for code samples (not applicable to web/math) - **Whitespace collapse** for web/math text ### Sequence Packing Samples are packed into fixed 4,096-token blocks. Labels are identical to `input_ids` (causal LM objective). No cross-document attention masking is applied between packed samples β€” this is standard practice for web-text pretraining.
--- ## Weight Initialization
Initialization scheme & residual scaling code ```python def initialize_weights(model, std=0.02, num_hidden_layers=20): layer_count = 20 residual_std = std / math.sqrt(2.0 * layer_count) # β‰ˆ 0.00316 for name, module in model.named_modules(): if isinstance(module, nn.Embedding): module.weight.data.normal_(mean=0.0, std=std) elif isinstance(module, nn.Linear): # Scaled-down std for output projections (residual path) proj_std = residual_std if name.endswith(("o_proj", "down_proj")) else std module.weight.data.normal_(mean=0.0, std=proj_std) if module.bias is not None: module.bias.data.zero_() elif "rmsnorm" in type(module).__name__.lower(): if module.weight is not None: module.weight.data.fill_(1.0) ``` - Residual projections (`o_proj`, `down_proj`) use scaled-down std (`0.02 / sqrt(2 Γ— 20) β‰ˆ 0.00316`) to prevent residual stream explosion at initialization, following the GPT-2 convention. - All other Linear layers use `std=0.02`. - RMSNorm scales start at 1.0 (identity).
--- ## Evaluation & Results
Training loss & perplexity curves, family comparison, full checkpoint history ### Training Loss Curve ![Training Loss Curve](training_loss_curve.png) ### Validation Perplexity Curve ![Perplexity Curve](perplexity_curve.png) **Final result: best validation loss 2.8906 β€” perplexity 18.00.** ### Comparison Across the StentorLabs Family | Model | Params | Best PPL | Training Tokens | Compute | Notes | |---|---|---|---|---|---| | Stentor-12M (v1) | 12.0M | 89.01 | 200M | 2Γ— T4 | v1 baseline | | Stentor-30M (v1) | 30.4M | 33.02 | 600M | 2Γ— T4 | | | Stentor2-12M | 12.3M | 26.61 | 480M | 2Γ— T4 | 8K TokenMonster vocab | | Stentor2-30M | 30.4M | 18.07 | 800M | 2Γ— T4 | 8K TokenMonster vocab | | **Portimbria-150M** | **151.0M** | **18.00** | **~6B** | **TPU v5e-8** | **32K Mistral BPE, 4K ctx, GQA** | > **Comparison note:** PPL values are not directly comparable across this family for two reasons: different tokenizers (TokenMonster 8K vs Mistral BPE 32K produce different token spaces) and different training data mixes (all prior StentorLabs models trained on web text only; Portimbria-150M is the first to include code and math). Both factors make a raw PPL number-to-number comparison misleading β€” Portimbria-150M's real improvement over Stentor2 is larger than the headline numbers suggest. ### Full Checkpoint History | Step | Eval Loss | Perplexity | Notes | |---|---|---|---| | 1,000 | 5.3438 | ~209 | First best checkpoint | | 2,000 | 4.1250 | ~62 | | | 3,000 | 3.5625 | ~35 | | | 8,000 | 3.4531 | ~31.6 | | | 9,000 | 3.3125 | ~27.4 | | | 10,000 | 3.1875 | ~24.3 | | | 11,000 | 3.1406 | ~23.1 | | | 12,000 | 3.0625 | ~21.4 | | | 13,000 | 3.0312 | ~20.7 | | | 14,000 | 2.9844 | ~19.8 | | | 15,000 | 2.9375 | ~18.9 | | | 17,000 | 2.9062 | ~18.3 | | | **18,000** | **2.8906** | **18.03** | **Best checkpoint saved** | | Final (epoch end) | 2.8906 | **18.00** | Final model |
--- ## Benchmark Results All benchmarks run zero-shot unless otherwise noted. ### Portimbria-150M Benchmarks | Benchmark | Task | Score | Notes | |---|---|---|---| | PIQA | Physical commonsense reasoning | 57.62% | 0-shot, acc_norm | | Winogrande | Pronoun resolution | 52.72% | 0-shot, acc | | TruthfulQA MC2 | Truthfulness (multiple choice) | 46.94% | 0-shot, acc | | ARC-Easy | Science QA | 33.80% | 0-shot, acc_norm | | HellaSwag | Commonsense NLI (completion) | 27.46% | 0-shot, acc_norm | | OpenBookQA | Elementary science | 24.60% | 0-shot, acc_norm | | ARC-Challenge | Science QA | 22.53% | 0-shot, acc_norm | | **ARC Average** | | **28.17%** | avg of Easy + Challenge | | CommonsenseQA | Commonsense reasoning | 19.90% | 0-shot, acc |
Comparison against peer models, analysis & evaluation script ### Comparison Against Peer Models The table below compares Portimbria-150M against models of similar scale using publicly available, official or community-verified benchmark numbers. All Portimbria-150M scores are **0-shot**. Peer model scores use the shot count shown in parentheses, which varies by source β€” comparisons are directional, not exact. Scores shown as β€” were not found in any official or sufficiently authoritative source and are intentionally omitted. | Model | Params | Tokens | HellaSwag | ARC-Easy | ARC-Challenge | ARC Avg | PIQA | Winogrande | TruthfulQA | OpenBookQA | Source | |---|---|---|---|---|---|---|---|---|---|---|---| | **Portimbria-150M** | 151M | ~6B | 27.46 (0-sh) | 33.80 (0-sh) | 22.53 (0-sh) | 28.17 (0-sh) | 57.62 (0-sh) | 52.72 (0-sh) | 46.94 (0-sh) | 24.60 (0-sh) | lm-eval, this card | | SmolLM2-135M | 135M | 2T | **42.1** (0-sh) | 48.99 (0-sh) | 38.81 (calcΒ²) | **43.9** (0-sh) | **68.4** (0-sh) | 51.3 (0-sh) | β€” | **34.6** (0-sh) | lighteval, official HF card; ARC-Easy from public comparison table; ARC-Challenge back-calculatedΒ² | | SmolLM-135M | 135M | 600B | 41.2 (0-sh) | 58.84 (0-sh) | 25.96 (calcΒ²) | 42.4 (0-sh) | 68.4 (0-sh) | 51.3 (0-sh) | β€” | 34.0 (0-sh) | lighteval, official HF card; ARC-Easy from public comparison table; ARC-Challenge back-calculatedΒ² | | Pythia-160M | 160M | ~300B | 29.9 (0-sh) | 40.0 (0-sh) | 25.3 (0-sh) | 32.65 (0-sh) | 62.0 (0-sh) | 50.9 (0-sh) | 44.3 (0-sh) | 31.2 (0-sh) | 0-sh scores from public comparison tableΒ³; TruthfulQA from HF Open LLM Leaderboard | | OPT-125M | 125M | 180B | 31.5 (10-sh) | 41.3 (0-sh) | 22.10 (β€”sh) | 31.70 (calc) | 62.08 (0-sh) | 51.6 (5-sh) | 42.9 (0-sh) | 28.00 (0-sh) | ARC-Easy 0-sh from public comparison table; HellaSwag 10-sh & WinoGrande 5-sh from HF LeaderboardΒ³; ARC-Challenge shot count unconfirmed | | GPT-Neo 125M | 125M | 300B | 28.67 (0-sh) | 40.7 (0-sh) | 22.87 (β€”sh) | 31.79 (calc) | 63.06 (0-sh) | 50.43 (0-sh) | 35.70 (β€”sh) | 26.20 (β€”sh) | HellaSwag/ARC-Easy/PIQA/WinoGrande: lm-eval, EleutherAI README (0-sh); ARC-Challenge/TruthfulQA/OpenBookQA: public comparison table, shots not statedβ‘£ | | GPT-2 (117M) | 117M | ~40B | 31.64 (β€”sh) | β€” | 22.95 (β€”sh) | β€” | 62.51 (β€”sh) | 50.04 (β€”sh) | 31.73 (β€”sh) | 27.20 (β€”sh) | GPT-2 124M public lm-eval rowΒΉ; shot counts not stated; no ARC-Easy/Avg available for exact 117M | | **Random Chance** | β€” | β€” | 25.0 | 25.0 | 25.0 | 25.0 | 50.0 | 50.0 | β€” | 25.0 | Uniform random over answer choices | **Table notes:** ΒΉ GPT-2 (117M) was released in 2019 before these benchmarks became standard. The scores listed are sourced from the closest available public lm-eval row (GPT-2 124M); the lm-eval harness `gpt2` shortname defaults to the 117M model, but no complete public 117M row with ARC-Easy was found. Shot counts are not stated in that source. ARC-Easy and ARC Avg are therefore unavailable and left as β€”. Β² SmolLM2 and SmolLM official model cards report only ARC Average (via lighteval); no per-split breakdown is published. ARC-Easy scores come from separate public comparison tables. ARC-Challenge is back-calculated as `2 Γ— ARC-Avg βˆ’ ARC-Easy` and marked `(calcΒ²)`. These derived values are estimates and have not been independently verified against a direct lm-eval run. Β³ Pythia-160M scores (HellaSwag, ARC-Easy, ARC-Challenge, PIQA, WinoGrande, OpenBookQA) have been updated to explicitly 0-shot values sourced from a public zero-shot comparison table, superseding the previously listed mixed-shot HF Open LLM Leaderboard entries. TruthfulQA remains from the HF Leaderboard (0-shot). For OPT-125M, ARC-Easy is from the same 0-shot comparison table; HellaSwag (10-shot) and WinoGrande (5-shot) are retained from the HF Leaderboard as no 0-shot replacement was found; ARC-Challenge shot count is unconfirmed in that source. β‘£ GPT-Neo 125M HellaSwag, ARC-Easy, PIQA, and WinoGrande are 0-shot from the official EleutherAI gpt-neo GitHub README. ARC-Challenge, TruthfulQA, and OpenBookQA are sourced from a public comparison table; shot counts for those three tasks are not stated. β‘€ TruthfulQA MC2 does not have a meaningful random chance baseline β€” it measures normalized probability mass assigned to all correct completions rather than a standard n-way classification, so no uniform random reference is applicable. Note that **Portimbria's ARC-Challenge (22.53) and OpenBookQA (24.60) both fall below the 25.0 random baseline**, likely due to acc_norm length normalization penalizing the model's output distributions at this scale. β€” = not found in a reliable source, or shot count makes direct comparison inappropriate; omitted rather than estimated. ### Analysis Portimbria-150M was trained on **~6 billion tokens** β€” **2% of the data used for Pythia-160M** (~300B tokens), **2% of GPT-Neo 125M** (~300B tokens), **3.3% of OPT-125M** (~180B tokens), and a mere **0.3% of SmolLM2-135M** (2T tokens). **TruthfulQA is Portimbria's most consistent standout.** Across every peer with a TruthfulQA entry, Portimbria leads: 46.94 vs GPT-2's 31.73, vs GPT-Neo's 35.70, vs OPT-125M's 42.9, and vs Pythia-160M's 44.3. That 15-point gap over GPT-2 and an 11-point gap over GPT-Neo are not noise β€” they suggest that the web+code+math curriculum and the longer 4096-token training context are doing real work at the quality level that TruthfulQA targets. **Portimbria wins TruthfulQA against every peer model in this table on 2% of their training data.** **Winogrande is the other consistent win.** Portimbria (52.72) beats every model in the table: GPT-2 (50.04), GPT-Neo (50.43), OPT-125M (51.6), Pythia-160M (50.9), and both SmolLM models (51.3) β€” despite all of them having seen vastly more training data. **The honest gaps are real.** On HellaSwag, ARC-Easy, ARC-Challenge, PIQA, and OpenBookQA, Pythia-160M, GPT-Neo, OPT-125M, and GPT-2 all score higher. Those gaps are genuine β€” Portimbria trails Pythia-160M by ~2.5 points on HellaSwag, ~6.2 points on ARC-Easy, and ~6.6 points on OpenBookQA β€” all explainable by Pythia's 50Γ— token advantage, but still real differences. These are the benchmarks with room to close through fine-tuning or extended pretraining. **Against GPT-2 (124M proxy at unconfirmed shot count)**, Portimbria competes respectably given the token budget gap: trailing on HellaSwag (27.46 vs 31.64), PIQA (57.62 vs 62.51), and OpenBookQA (24.60 vs 27.20), but winning decisively on TruthfulQA and WinoGrande. ARC-Challenge is a near-tie (22.53 vs 22.95). **SmolLM2-135M is the undisputed leader** across every filled benchmark cell. With 333Γ— the training data, its margins are consistent and expected β€” this is not a comparison Portimbria can win at current training scale. SmolLM-135M (600B tokens) leads on HellaSwag, PIQA, and ARC-Easy as well, with a notable ARC-Easy of 58.84 β€” though its back-calculated ARC-Challenge (25.96) is actually close to Portimbria's 22.53, and Portimbria leads on WinoGrande (52.72 vs 51.3) and TruthfulQA. What this model is, beyond the numbers, is an **exceptionally data-efficient foundation**. Winning TruthfulQA and WinoGrande across the full peer group on 6B tokens β€” while trailing meaningfully only on commonsense-heavy tasks that reward scale β€” is precisely what you'd hope to see from a model trained on a high-quality, mixed-domain curriculum. Fine-tuned on a domain-specific corpus or targeted at reasoning tasks, Portimbria-150M has a genuine path to closing the remaining gaps. All of this, built from scratch, for free, on a TPU available to anyone with a Kaggle account. ### Evaluation Setup (for Portimbria-150M) Benchmarks were run on Kaggle with 2Γ— Tesla T4 GPUs using the script below. No API token is required β€” the model is public. Each benchmark block runs independently so a single failure never stops the rest. ```python import os, sys, subprocess, json, time, re, threading from pathlib import Path from datetime import datetime os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["CUDA_LAUNCH_BLOCKING"] = "0" os.environ["NCCL_P2P_DISABLE"] = "1" os.environ["NCCL_IB_DISABLE"] = "1" os.environ["NCCL_SHM_DISABLE"] = "1" os.environ["NCCL_SOCKET_IFNAME"] = "eth0" # ── Install deps ────────────────────────────────────────────────────────────── subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "accelerate", "transformers"], check=True) subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "git+https://github.com/EleutherAI/lm-evaluation-harness.git"], check=True) # ── Config ──────────────────────────────────────────────────────────────────── MODEL = "StentorLabs/Portimbria-150M" DTYPE = "float16" BATCH = "32" SEED = 42 OUT = "./results" MODEL_ARGS = f"pretrained={MODEL},dtype={DTYPE},trust_remote_code=True" BLOCKS = [ ("block1", "PIQA Β· OpenBookQA Β· TruthfulQA", "piqa,openbookqa,truthfulqa_mc2", 0, None), ("block2", "Winogrande Β· CommonsenseQA", "winogrande,commonsense_qa", 0, None), ("block3", "HellaSwag", "hellaswag", 0, None), ("block4", "ARC-Easy Β· ARC-Challenge", "arc_easy,arc_challenge", 0, None), ] LAUNCH_BASE = [ "accelerate", "launch", "--multi_gpu", "--num_processes=2", "--mixed_precision=fp16", "-m", "lm_eval", "--model", "hf", "--model_args", MODEL_ARGS, "--batch_size", BATCH, "--seed", str(SEED), ] # ── Helpers ─────────────────────────────────────────────────────────────────── DEBUGGER_NOISE = re.compile( r"(Debugger warning|frozen modules|PYDEVD|make the debugger|pass -X|Note: Debugging)" ) def ts(): return datetime.now().strftime("%H:%M:%S") def stream(proc): def _read(pipe): for raw in iter(pipe.readline, ""): line = raw.rstrip() if line and not DEBUGGER_NOISE.search(line): print(f" [{ts()}] {line}", flush=True) t_out = threading.Thread(target=_read, args=(proc.stdout,), daemon=True) t_err = threading.Thread(target=_read, args=(proc.stderr,), daemon=True) t_out.start() t_err.start() proc.wait() t_out.join() t_err.join() # ── Run ─────────────────────────────────────────────────────────────────────── Path(OUT).mkdir(parents=True, exist_ok=True) summary = {} for i, (name, title, tasks, fewshot, extra) in enumerate(BLOCKS, 1): print(f"\n{'='*60}", flush=True) print(f" [{ts()}] BLOCK {i}/{len(BLOCKS)} β€” {title}", flush=True) print(f"{'='*60}\n", flush=True) cmd = LAUNCH_BASE + [ "--tasks", tasks, "--num_fewshot", str(fewshot), "--output_path", f"{OUT}/{name}", ] if extra: cmd += extra t0 = time.time() try: proc = subprocess.Popen( cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, bufsize=1, ) stream(proc) elapsed = round((time.time() - t0) / 60, 1) if proc.returncode == 0: print(f"\n βœ… [{ts()}] {title} β€” done in {elapsed} min\n", flush=True) summary[name] = {"status": "ok", "elapsed_min": elapsed} else: print(f"\n ❌ [{ts()}] {title} β€” exit {proc.returncode} ({elapsed} min)\n", flush=True) summary[name] = {"status": "failed", "exit_code": proc.returncode, "elapsed_min": elapsed} except Exception as exc: elapsed = round((time.time() - t0) / 60, 1) print(f"\n ❌ [{ts()}] {title} β€” {exc}\n", flush=True) summary[name] = {"status": "failed", "error": str(exc), "elapsed_min": elapsed} # ── Final summary ───────────────────────────────────────────────────────────── passed = sum(1 for v in summary.values() if v["status"] == "ok") print(f"\n{'='*60}", flush=True) print(f" DONE β€” {passed}/{len(BLOCKS)} succeeded", flush=True) print(f"{'='*60}", flush=True) for name, info in summary.items(): icon = "βœ…" if info["status"] == "ok" else "❌" mins = info.get("elapsed_min", "β€”") print(f" {icon} {name:<10} {mins} min", flush=True) summary_path = f"{OUT}/run_summary.json" with open(summary_path, "w") as fh: json.dump(summary, fh, indent=2) print(f"\n Summary β†’ {summary_path}\n", flush=True) if any(v["status"] == "failed" for v in summary.values()): sys.exit(1) ``` **Metrics to report per task:** | Task | Metric | |---|---| | PIQA | `acc_norm` | | OpenBookQA | `acc_norm` | | TruthfulQA | `mc2` | | Winogrande | `acc` | | CommonsenseQA | `acc` | | HellaSwag | `acc_norm` | | ARC-Easy | `acc_norm` | | ARC-Challenge | `acc_norm` |
--- ## Model Outputs
5 real generation samples with honest good/bad analysis The following samples illustrate real generation quality. All responses use **temp = 0.6, top_p = 0.85, rep_penalty = 1.15**. The first three are the first three responses obtained with no cherry-picking β€” what you can expect on average. The fourth is a good cherry-picked response (what you can expect when you're lucky). The fifth is a bad cherry-picked response (what you can expect when you're unlucky). --- **1. Average β€” "The definition of biology is"** *(max_tokens = 40, cut off)* > "a complex and multifaceted field that encompasses the study, treatment, and management of various biological processes. From the development of genetic engineering to the advancements in medicine and medicine, understanding" **βœ… Good:** The opening is grammatically smooth and the framing of biology as "complex and multifaceted" is appropriate. Correctly associates it with genetic engineering and medicine β€” the model clearly has relevant domain associations at the topical level. Given more tokens, this would very likely have continued coherently. **❌ Bad:** "treatment, and management of biological processes" frames biology more like a clinical or engineering field than a scientific one β€” a subtle but real semantic drift. More glaring is "medicine and medicine," a direct word repetition in a single phrase, which is a hallucination/repetition failure. --- **2. Average β€” "Some of the most popular universities are"** *(max_tokens = 40, cut off)* > "the University of California, a university in the United States. In addition to being the world's largest university, there are several other universities that offer more than one-day job opportunities. The university" **βœ… Good:** Names a real, recognizable institution. Stays on-topic throughout, remains grammatically coherent, and correctly anchors UC in the United States. **❌ Bad:** UC is not the world's largest university β€” that's a confident confabulation. "More than one-day job opportunities" is semantically broken; it reads like a garbled web text phrase and doesn't mean anything coherent as a university selling point. The prompt implied a list but the model produced a paragraph about a single university. --- **3. Average β€” "When making a business, here are some things to look out for:"** *(max_tokens = 40, cut off)* > "- Your Business is more likely to be successful. It is important to have a strong business plan in place and make sure that you are getting the best possible experience. - You want to know what your" **βœ… Good:** This is arguably the best of the three average samples. The model correctly interprets the colon as a list-format signal and produces bullet points. "Strong business plan" is genuinely relevant advice. The two bullets are grammatically clean and stay on topic. **❌ Bad:** "More likely to be successful" is vacuous without a comparison β€” successful than what? "Getting the best possible experience" is generic web-text filler that doesn't mean anything actionable. The second bullet ("You want to know what your") is mid-thought, but that's purely the 40-token ceiling β€” not the model. What is a model issue: the two bullets are redundant in sentiment β€” both are vague encouragements rather than distinct concrete tips. --- **4. Good (cherry-picked) β€” "Tips for running a marathon:"** *(max_tokens = 350, did not hit limit)* > "- Track your progress regularly. This helps you stay on track and ensures that you are not getting any immediate results. - Set realistic goals for yourself. Aim to set specific, measurable goals for each race. - Stay motivated by creating a roadmap for yourself. Aim to keep track of your progress and make necessary adjustments to your plan. By following these tips, you can successfully run a marathon in the United States, ensuring that you have a successful and healthy marathon. Remember to focus on your core competencies, such as speed, agility, and mental toughness, to maximize your chances of success. So go ahead, embrace the challenges, and watch your body run wild with confidence!" **βœ… Good:** The clearest demonstration of Portimbria's ceiling. The response is well-structured across ~350 tokens, self-contained, and completes naturally without hitting the limit β€” a sign of genuine coherence rather than just stringing tokens. "Speed, agility, and mental toughness" are legitimately relevant marathon concepts. The motivational closing is tonally appropriate. "Specific, measurable goals" is real coaching language. This is what the model looks like on a good day. **❌ Bad:** "ensures that you are not getting any immediate results" is the opposite of what the tip is trying to say β€” a factual inversion buried in otherwise fluent prose, which is arguably more dangerous than obvious nonsense because it reads naturally. "Run a marathon in the United States" injects unprompted geographic specificity with no reason. "Watch your body run wild with confidence" is awkward phrasing that slightly undercuts the otherwise decent closing. "Core competencies" is corporate jargon that reads oddly in a fitness context. The structural coherence is genuinely impressive; the semantic errors embedded within it are the honest limitation. --- **5. Bad (cherry-picked) β€” "What is 1 + 1?"** *(max_tokens ~350, hit limit)* > "A. 20 B. 30 C. 40 D. 50 D. 60 E. 70 Explanation: 1) 1/2 = 1/2 (2) 2) 3/4 = 3/4 4) 4/4 = 4/4 5) 5/4 = 4/4 6) 6/4 = 4/4 7) 7/4 = 4/4 8) 9/4 = 5/4 9) 10/4 = 6/4 11) 10/4 = 6/4 12) 11/4 = 6/4 13) 13/4 = 6/4 14) 13/4 = 6/4 15) 13/4 = 6/4 16) 13/4 = 6/4 17) 14/4 = 6/4 18) 15/4 = 6/4 19) 16/4 = 6/4 20) 21/4 = 6/4 22) 23/4 = 6/4 24) 24/4 = 6/4 25) 26/4 = 6/4 26) 27/4 = 6/4 28) 29/4 = 6/4 29) 21/4 = 6/4 21) 22/4 = 6/4 23) 23/4 = 6/4 24) 24/4 = 6/4 25) 25/4 = 6/4 26) 26/4 = 6/4 27) 27/4 = 6/4 28) 29/4 = 6/4 29) 29/4 = 6/4 20) 29/4 = 6/4 21) 29/4 = 6/4 22) 29/4 = 6" **βœ… Good:** The model correctly recognizes "What is X?" as potentially a multiple-choice exam format and attempts to produce structured output with labeled options and an "Explanation:" section. That's a real and interesting structural pattern recognition. It also associates the prompt with fractions and arithmetic notation β€” showing it has some sense of mathematical register. **❌ Bad:** Almost everything else. The correct answer is 2, but the lowest option offered is 20. The explanation is a runaway repetition loop β€” the fraction sequence degenerates into `6/4 = 6/4` repeated indefinitely, which is the clearest example of what happens without adequate repetition penalty on a structurally-patterned output. Letter "D" appears twice in the options list. None of the fractions have any logical connection to 1+1. This is a base model with no instruction tuning and no arithmetic capability β€” asking it a direct math question with a short, definitive answer is exactly the kind of prompt that exposes those limits. This output also illustrates why `repetition_penalty β‰₯ 1.05` is non-negotiable; without it, pattern-heavy outputs like numbered lists collapse into loops almost immediately.
--- ## Training Dynamics
Step-by-step training phase breakdown & throughput details The training run processed approximately **6 billion tokens** across a single epoch (epoch 0), running for **22,889 optimizer steps** before the token budget was exhausted. **Early training (steps 0–1,144, warmup phase):** LR ramped linearly from 0 to peak. Loss dropped quickly from above 5.0. First best checkpoint recorded at step 1,000 (eval loss 5.3438). **Mid training (steps 1,144–18,311, stable cosine phase):** Smooth and consistent loss reduction. Gradient norms were well-behaved in the 0.3–0.6 range for most steps, with occasional spikes (notably 3.7 at step 1,800 and 8.5 at step 13,200 β€” both recovered cleanly). New best checkpoints recorded at steps 1,000 / 2,000 / 3,000 / 8,000 / 9,000 / 10,000 / 11,000 / 12,000 / 13,000 / 14,000 / 15,000 / 17,000 / 18,000. **Late training (steps 18,311–22,889, cosine decay tail):** LR decaying toward zero. Eval loss stopped improving after step 18,000, confirming the best model was saved at that checkpoint. **Throughput:** ~253,000 global tokens/sec average (~31,600 per chip), with a brief XLA warmup window reset at step 300. **Total wall-clock time:** ~8.02 hours (epoch training) + ~8 minutes (final eval and save).
--- ## Use Cases & Intended Uses | Use Case | Suitability | Notes | |---|---|---| | Studying transformer training dynamics at 150M scale | βœ… High | Full architecture, hyperparameters, and training curves published | | Speculative decoding draft model | βœ… High | Fast enough to draft for larger Llama-family targets | | Benchmarking 4K-context inference latency | βœ… High | Realistic long-context workload | | Quantization / conversion pipeline testing | βœ… High | Standard architecture, no custom ops | | Teaching material for LLM courses | βœ… High | Fully documented, reproducible from scratch | | Edge deployment experiments | βœ… High | ~600MB in FP16; larger than Stentor2 but highly feasible on modern edge hardware | | Domain-specific fine-tuning research | βœ… High | Standard transformers; fine-tune like any LLaMA model | | Code completion prototyping | ❌ Not suitable | Code prompts produce English text, not code β€” see Honest Notices | | Text continuation / creative writing | βœ… Medium | Good fluency; limited thematic fidelity | | Factual Q&A | ❌ Not suitable | Unreliable world knowledge at this scale | | Production deployment | ❌ Not suitable | No safety tuning | | Non-English text | ❌ Not suitable | Training data is English-heavy | | Instruction following | ❌ Not suitable | Base model only | --- ## Out-of-Scope Uses - **Any user-facing application** β€” No safety filtering, no alignment, no factual reliability. - **Medical, legal, or financial advice** β€” Cannot reason reliably over specialized knowledge. - **Generating content about real people** β€” Will fabricate. - **Automated content pipelines** β€” Output quality is insufficient for unreviewed publication. - **Instruction following** β€” This is a base next-token predictor. --- ## Ethical Considerations & Societal Impact
Data biases, safety considerations & societal impact ### Inherited Data Biases Trained on FineWeb-HQ, StarCoderData, and FineMath-4+ β€” all derived from web-scraped data. The model inherits: - **Western-centric perspective** β€” English-language web text skews toward Western viewpoints and cultural contexts. - **English monolingualism** β€” Mistral BPE is optimized for English. Other languages will produce high fertility and poor quality. - **Demographic underrepresentation** β€” Groups underrepresented in English web text will be underrepresented in outputs. - **Code ecosystem bias** β€” StarCoderData covers many programming languages, but this model was deliberately trained only on the Python, JavaScript, and TypeScript subsets. These three were chosen because they are among the most widely used languages in 2026 and are generally more accessible to the majority of developers. ### No Safety Tuning No RLHF, DPO, constitutional AI, or content filtering of any kind has been applied. ### Positive Aspects - **Democratizing AI research** β€” Trained entirely on free Kaggle TPU compute. - **Full transparency** β€” Complete training hyperparameters, architecture, and logs published. - **Minimal environmental footprint** β€” ~8 hours of TPU compute is negligible versus large-scale pretraining runs.
--- ## Inference Guide
CPU inference (INT8) & GPU inference (FP16) code ### CPU Inference (INT8 Dynamic Quantization) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("StentorLabs/Portimbria-150M") tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M") # Dynamically quantize for CPU model_int8 = torch.quantization.quantize_dynamic( model.cpu(), {torch.nn.Linear}, dtype=torch.qint8, ) inputs = tokenizer("The laws of physics state that", return_tensors="pt") with torch.inference_mode(): output = model_int8.generate(**inputs, max_new_tokens=80, do_sample=True) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### GPU Inference (FP16) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", torch_dtype=torch.float16, device_map="cuda", ).eval() tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M") def generate(prompt, max_new_tokens=100, temperature=0.8, top_p=0.9): input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) with torch.inference_mode(): output = model.generate( input_ids, attention_mask=torch.ones_like(input_ids), max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id, ) return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True) print(generate("Once upon a time in a distant kingdom")) ```
--- ## πŸš€ Free Inference β€” Try It Now **No GPU, no setup, no API key required.** StentorLabs hosts a free demo space for all Stentor models: > πŸ”— **[https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space](https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space)** --- ## Quantization
FP16, BF16 & 4-bit (bitsandbytes) quantization code ### FP16 (GPU) ```python model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", torch_dtype=torch.float16, ) ``` ### BF16 ```python model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", torch_dtype=torch.bfloat16, ) ``` ### 4-bit (bitsandbytes) ```bash pip install bitsandbytes accelerate ``` ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16) model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", quantization_config=bnb_config, device_map="auto", ) ```
--- ## 🌍 Community Contributions β€” Build on This Model Portimbria-150M is built by an independent solo researcher, not a large corporate AI lab. That means it doesn't have teams of engineers running downstream experiments β€” **that's where you come in.** This model is Apache 2.0 licensed and is explicitly intended to be modified, extended, and redistributed. Here are things StentorLabs actively encourages the community to try: - **Fine-tune it** on your domain β€” instruction tuning, domain adaptation, RLHF, DPO, anything goes - **Quantize it** β€” 4-bit, 8-bit, GGUF, GPTQ, AWQ, ONNX, all highly encouraged - **Convert it** to other formats β€” GGUF for llama.cpp, ONNX for deployment, CoreML for Apple Silicon - **Run LoRA or QLoRA** to adapt it cheaply on consumer hardware - **Use it for speculative decoding** with a larger Llama-family target - **Benchmark it** formally and share results - **Publish your work** β€” fine-tunes, quantized versions, adapters, research findings, derivative models, anything If you build something with Portimbria-150M, please share it on HuggingFace and tag or link back to the base model. Every community result makes this model more useful for everyone. ### LoRA / QLoRA Starter Configuration
Starter config, recommended hyperparameters & QLoRA note If you haven't fine-tuned a Llama-family model before, here's a proven starting point for Portimbria-150M: ```python from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", torch_dtype=torch.float16, ) tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Portimbria-150M") lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # LoRA rank β€” try 32 if underfitting lora_alpha=32, # alpha = 2Γ— rank is a reliable default target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # β†’ trainable params: ~3.1M || all params: ~154M || trainable%: ~2.0% ``` **Recommended fine-tuning hyperparameters:** | Hyperparameter | Value | Notes | |---|---|---| | Learning rate | 2e-4 | Scale down to 1e-4 for very small datasets | | Optimizer | AdamW | `betas=(0.9, 0.999)`, `eps=1e-8` | | LR scheduler | Cosine with warmup | ~5% warmup steps | | Batch size | 4–16 | Per device; use gradient accumulation if memory-limited | | Epochs | 2–5 | Watch for overfitting after epoch 2 | | Max sequence length | 512–2048 | Up to 4096 is supported | For **QLoRA** (4-bit quantized base + LoRA adapters on top), add `BitsAndBytesConfig(load_in_4bit=True)` when loading the base model β€” the LoRA config and training hyperparameters above apply unchanged. This lets you fine-tune on a single consumer GPU with ~4–6 GB VRAM.
--- ## Format Conversion
Convert to GGUF (llama.cpp) & ONNX ### Convert to GGUF (llama.cpp) ```bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && pip install -r requirements.txt huggingface-cli download StentorLabs/Portimbria-150M --local-dir portimbria-150m python convert_hf_to_gguf.py portimbria-150m/ \\ --outfile portimbria-150m.gguf \\ --outtype f16 ./llama-quantize portimbria-150m.gguf portimbria-150m-q4_k_m.gguf q4_k_m ./llama-cli -m portimbria-150m-q4_k_m.gguf -p "The history of computing" -n 100 ``` ### Convert to ONNX ```bash pip install optimum[exporters] optimum-cli export onnx \\ --model StentorLabs/Portimbria-150M \\ --task text-generation-with-past \\ portimbria-150m-onnx/ ```
--- ## Speculative Decoding Portimbria-150M can serve as a fast **draft model** to accelerate inference from larger Llama-family target models. Because it shares vocabulary with standard Llama/Mistral models (32K BPE), the acceptance rate should be substantially higher than Stentor2 models (which use a different 8K tokenizer).
Speculative decoding code & vocabulary compatibility notes ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch draft_model = AutoModelForCausalLM.from_pretrained( "StentorLabs/Portimbria-150M", torch_dtype=torch.float16, ).to("cuda") target_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.float16, device_map="auto", ) target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") inputs = target_tokenizer("Explain the concept of recursion:", return_tensors="pt").to("cuda") outputs = target_model.generate( **inputs, assistant_model=draft_model, do_sample=True, max_new_tokens=200, ) print(target_tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` > **Vocabulary compatibility:** Portimbria-150M uses the **Mistral-7B tokenizer** (32K BPE), which is *not* identical to the LLaMA-3 tokenizer (also 32K but with different token merges). It is compatible with models that use the same Mistral BPE vocabulary (e.g. `mistralai/Mistral-7B-v0.1` and derivatives). Vocabulary-compatible speculative decoding will yield higher acceptance rates; vocabulary-mismatched pairs will still work via HuggingFace's assisted generation but with lower acceptance rates.
--- ## Bias, Risks & Limitations - **Factual Accuracy:** All factual outputs should be treated as unreliable without verification. - **Context Boundary:** Hard limit of 4,096 tokens. - **English Bias:** Training data is English-dominant. - **Training Data Bias:** Inherits biases in FineWeb-HQ, StarCoderData, and FineMath-4+. - **Hallucination:** Will produce confident but fabricated content. - **No Alignment:** No RLHF, DPO, or constitutional training. - **Code Generation:** Code prompts produce English text output rather than functional code. The model does not generate syntactically or logically valid code in response to code-related prompts. - **Shared Tensor Warning:** `Removed shared tensor {'lm_head.weight'}` is expected. Safe to ignore. - **Gradient Spikes:** Two isolated gradient norm spikes occurred during training (step 1,800: 3.72, step 13,200: 8.56). Both recovered cleanly in subsequent steps with no apparent impact on the loss trajectory. --- ## Related Work
Comparable sub-200M models & related research papers ### Comparable Sub-200M Base Models | Model | Parameters | Vocab | Context | Notes | |---|---|---|---|---| | **Portimbria-150M** (this model) | 151M | 32K BPE | 4,096 | Trained on 6B tokens, TPU v5e-8 | | Stentor2-30M | 30.4M | 8K TokenMonster | 1,024 | StentorLabs family | | Pythia-160M | 160M | 50K BPE | 2,048 | EleutherAI; 300B Pile tokens | | GPT-2 (117M) | 117M | 50K BPE | 1,024 | OpenAI; 40GB WebText | | OPT-125M | 125M | 50K BPE | 2,048 | Meta; 180B tokens | | TinyLlama-1.1B | 1,100M | 32K BPE | 2,048 | 3T tokens; different scale tier | ### Related Research Papers | Paper | Relevance | |---|---| | [Scaling Laws](https://arxiv.org/abs/2001.08361) β€” Kaplan et al., 2020 | Informs token budget decisions | | [Chinchilla](https://arxiv.org/abs/2203.15556) β€” Hoffmann et al., 2022 | 6B tokens for 150M params is ~40Γ— (above Chinchilla optimal) | | [GQA](https://arxiv.org/abs/2305.13245) β€” Ainslie et al., 2023 | Grouped Query Attention used in this model | | [RoPE](https://arxiv.org/abs/2104.09864) β€” Su et al., 2021 | Positional encoding | | [LLaMA](https://arxiv.org/abs/2302.13971) β€” Touvron et al., 2023 | Architecture basis | | [Pythia](https://arxiv.org/abs/2304.01373) β€” Biderman et al., 2023 | Comparable small-model scaling study | | [Speculative Decoding](https://arxiv.org/abs/2211.17192) β€” Leviathan et al., 2023 | Primary deployment use case |
--- ## Environmental Impact
Hardware, duration & estimated carbon | Factor | Value | |---|---| | Hardware | Google Cloud TPU v5e-8 | | Active Training Duration | ~8.02 hours | | Cloud Provider | Google (via Kaggle free tier) | | Compute Region | United States | | Estimated Carbon | Minimal (< 1.0 kg COβ‚‚e estimated) | The TPU v5e is substantially more energy-efficient per FLOP than comparable GPU hardware. Running on Kaggle's free tier also means no dedicated data center allocation beyond what Kaggle already operates.
--- ## Citation
BibTeX ```bibtex @misc{izumoto2026portimbria150m, title = {Portimbria-150M}, author = {Kai Izumoto}, year = {2026}, publisher = {StentorLabs}, howpublished = {\\url{https://huggingface.co/StentorLabs/Portimbria-150M}}, note = {151M parameter LlamaForCausalLM base model with GQA trained from scratch on ~6B tokens (FineWeb-HQ, StarCoderData, FineMath-4+) using a Google Cloud TPU v5e-8 on Kaggle free compute. 4096-token context, 32K Mistral BPE vocabulary. Apache 2.0 license.} } ```
--- ## Model Card Contact Questions, benchmarks, or feedback: [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open a [discussion](https://huggingface.co/StentorLabs/Portimbria-150M/discussions).
Made with ❀️ by [StentorLabs](https://huggingface.co/StentorLabs) _Democratizing AI through accessible, efficient models β€” trained on free compute, shared with everyone._