Lance-3B NVFP4 (image checkpoint)

4-bit floating-point quantized variant of bytedance-research/Lance, the Lance_3B image-focused checkpoint, using NVIDIA's NVFP4 format (E2M1 weights + FP8 E4M3 per-block scales).

Targets Blackwell tensor cores (RTX 50-series, B100/B200) where it gets hardware-accelerated dequantization with 5–10× the throughput of INT4 once paired with TensorRT-LLM / vLLM ≥ 0.8.

File-size: 24.7 GB → ~6 GB (4×)

Companion to the AWQ INT4 image variant: Reza2kn/Lance-3B-AWQ-INT4. Video-flavoured sibling: Reza2kn/Lance-3B-Video-NVFP4.

Format

4-bit E2M1 codes per weight (LUT {±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6})
FP8 E4M3 scale per 16-element block (1 byte per 16 weights)
Average 4.5 bits per weight
Both scales_fp8 (uint8 bytes carrying float8_e4m3fn) and scales_bf16 (redundant copy) are stored — drop one for slimmer storage if your runtime supports the other.

See the video sibling NVFP4 README for the full FP4 LUT and storage layout — identical here.

Calibration

Same AWQ activation statistics as the AWQ-INT4 image variant — 252 und-path + 252 gen-path Linears, all with activation data, calibrated on Lance's bundled x2t_image + t2i example sets (108.5 M tokens total).

File layout

Lance_3B-NVFP4/
├── nvfp4_state_dict.safetensors   # ~6 GB: packed FP4 + FP8 + bf16 scales + pass-through
├── nvfp4_meta.json                # per-weight scheme + block_size + shape + FP4 LUT
└── README.md

How to use

Production: vLLM ≥ 0.8 / TensorRT-LLM on Blackwell (Lance not yet wired in but format is compatible). Verification: reference WQLinearNVFP4 swap-in module at https://github.com/Reza2kn/lance-quant.

License

Apache 2.0, inherited from the base model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/Lance-3B-NVFP4

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Quantized

(15)

this model