Bonsai-4B `bonsai_q1_f32` for MLC/WebLLM

This repository contains an experimental MLC/WebLLM conversion of prism-ml/Bonsai-4B-unpacked. It is a browser-runtime artifact, not a new model, fine-tune, GGUF, MLX, or ONNX mirror.

The weights use the local bonsai_q1_f32 format: binary signs packed into uint32 words with one FP32 scale per 128-wide group. Linear layers, embeddings, and the final lm head are stored in this format.

Artifact Summary

Field	Value
Source checkpoint	`prism-ml/Bonsai-4B-unpacked`
Architecture	Qwen3-shaped decoder
MLC model type	`qwen3`
Quantization	`bonsai_q1_f32`
Conversation template	`qwen3_nothink`
Context window in config	`32768`
Prefill chunk in config	`2048`
Total parameters	4,021,784,576
Quantized parameter size	0.586 GB
Bits per parameter	1.251
Parameter shards	19
Artifact size	about 564 MB
WebGPU library	`libs/bonsai-4b-bonsai_q1_f32-webgpu.wasm`

Runtime Requirement

This artifact requires an MLC/WebLLM runtime with Bonsai q1 support. It is not expected to load in an unmodified upstream WebLLM build until the Bonsai q1 runtime path is upstreamed.

Use this repository when you control the WebLLM runtime and want to test browser-local Bonsai inference through WebGPU.

WebLLM Configuration

const appConfig = {
  model_list: [
    {
      model: "https://huggingface.co/welcoma/Bonsai-4B-bonsai_q1_f32-MLC/resolve/main/",
      model_id: "Bonsai-4B-q1-MLC",
      model_lib:
        "https://huggingface.co/welcoma/Bonsai-4B-bonsai_q1_f32-MLC/resolve/main/libs/bonsai-4b-bonsai_q1_f32-webgpu.wasm",
      overrides: {
        context_window_size: 4096,
        prefill_chunk_size: 512,
      },
    },
  ],
};

The smaller override values above are intended for local browser smoke tests. Increase them only after measuring browser memory and cache behavior on the target device.

Validation

The 4B artifact was converted and WebGPU-compiled on the GCP MLC/WebLLM builder VM, not on a local laptop.

Source: prism-ml/Bonsai-4B-unpacked
Quantization: bonsai_q1_f32
Conversion peak RAM: 7.491 GB on CPU
WebGPU compile completed successfully
Compile estimate without KV cache: 2152.93 MB
Compile estimate with 4K KV cache: 3304.93 MB
Hugging Face round-trip check confirmed README, WebGPU wasm, and no accidental resolve/ mirror path.

Limitations

This is an experimental runtime artifact, not a general transformers model checkpoint.
Quality evaluation is limited to conversion/runtime smoke checks; no benchmark score is claimed by this repository.
Browser success depends on WebGPU support, available GPU memory, cache quota, and a compatible patched WebLLM runtime.
The ternary Bonsai family is not represented by this q1 format. Ternary models need a separate 2-bit/ternary MLC path.

Provenance

Original model by Prism ML:

MLC/WebLLM conversion by welcoma.

Downloads last month: 7

Model tree for welcoma/Bonsai-4B-bonsai_q1_f32-MLC

Base model

prism-ml/Bonsai-4B-unpacked

Quantized

(7)

this model

Bonsai-4B bonsai_q1_f32 for MLC/WebLLM