Bonsai-4B bonsai_q1_f32 for MLC/WebLLM

This repository contains an experimental MLC/WebLLM conversion of prism-ml/Bonsai-4B-unpacked. It is a browser-runtime artifact, not a new model, fine-tune, GGUF, MLX, or ONNX mirror.

The weights use the local bonsai_q1_f32 format: binary signs packed into uint32 words with one FP32 scale per 128-wide group. Linear layers, embeddings, and the final lm head are stored in this format.

Artifact Summary

Field Value
Source checkpoint prism-ml/Bonsai-4B-unpacked
Architecture Qwen3-shaped decoder
MLC model type qwen3
Quantization bonsai_q1_f32
Conversation template qwen3_nothink
Context window in config 32768
Prefill chunk in config 2048
Total parameters 4,021,784,576
Quantized parameter size 0.586 GB
Bits per parameter 1.251
Parameter shards 19
Artifact size about 564 MB
WebGPU library libs/bonsai-4b-bonsai_q1_f32-webgpu.wasm

Runtime Requirement

This artifact requires an MLC/WebLLM runtime with Bonsai q1 support. It is not expected to load in an unmodified upstream WebLLM build until the Bonsai q1 runtime path is upstreamed.

Use this repository when you control the WebLLM runtime and want to test browser-local Bonsai inference through WebGPU.

WebLLM Configuration

const appConfig = {
  model_list: [
    {
      model: "https://huggingface.co/welcoma/Bonsai-4B-bonsai_q1_f32-MLC/resolve/main/",
      model_id: "Bonsai-4B-q1-MLC",
      model_lib:
        "https://huggingface.co/welcoma/Bonsai-4B-bonsai_q1_f32-MLC/resolve/main/libs/bonsai-4b-bonsai_q1_f32-webgpu.wasm",
      overrides: {
        context_window_size: 4096,
        prefill_chunk_size: 512,
      },
    },
  ],
};

The smaller override values above are intended for local browser smoke tests. Increase them only after measuring browser memory and cache behavior on the target device.

Validation

The 4B artifact was converted and WebGPU-compiled on the GCP MLC/WebLLM builder VM, not on a local laptop.

  • Source: prism-ml/Bonsai-4B-unpacked
  • Quantization: bonsai_q1_f32
  • Conversion peak RAM: 7.491 GB on CPU
  • WebGPU compile completed successfully
  • Compile estimate without KV cache: 2152.93 MB
  • Compile estimate with 4K KV cache: 3304.93 MB
  • Hugging Face round-trip check confirmed README, WebGPU wasm, and no accidental resolve/ mirror path.

Limitations

  • This is an experimental runtime artifact, not a general transformers model checkpoint.
  • Quality evaluation is limited to conversion/runtime smoke checks; no benchmark score is claimed by this repository.
  • Browser success depends on WebGPU support, available GPU memory, cache quota, and a compatible patched WebLLM runtime.
  • The ternary Bonsai family is not represented by this q1 format. Ternary models need a separate 2-bit/ternary MLC path.

Provenance

Original model by Prism ML:

MLC/WebLLM conversion by welcoma.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for welcoma/Bonsai-4B-bonsai_q1_f32-MLC

Quantized
(7)
this model