Qwen3-4B-Instruct-2507-heretic OpenVINO NF4 for Intel NPU

This repository contains a ready-to-run OpenVINO IR export of p-e-w/Qwen3-4B-Instruct-2507-heretic, prepared for local inference on Intel NPU through OpenVINO Model Server.

This is not a fine-tune. It is an OpenVINO NF4 runtime export of the source model above. In the local benchmark below, this was the best balance of speed and memory among the tested INT4, NF4, and INT8 exports.

Source model

Source model: p-e-w/Qwen3-4B-Instruct-2507-heretic
Original base model: Qwen/Qwen3-4B-Instruct-2507
Architecture: Qwen3ForCausalLM
Task: text generation
License: Apache-2.0, inherited from the source model metadata

OpenVINO export

Compression metadata from this export:

{
  "mode": "nf4",
  "nncf_mode": "NF4",
  "group_size": -1,
  "ratio": 1.0,
  "all_layers": true
}

Known local artifact size:

openvino_model.bin: about 1.88 GiB

Tested Intel NPU runtime

Tested locally on Windows with OpenVINO Model Server / OpenVINO GenAI:

Target device: NPU
OVMS task: text_generation
Runtime prompt limit: 16384
Max concurrent sequences: 1
Cache interval multiplier: 64

Example OVMS command:

ovms.exe `
  --model_path Q:/llm/models/OpenVINO/p-e-w--Qwen3-4B-Instruct-2507-heretic-text-fp16-nf4-cw-ov `
  --model_name p-e-w--Qwen3-4B-Instruct-2507-heretic-nf4-npu `
  --rest_port 8000 `
  --rest_bind_address 0.0.0.0 `
  --task text_generation `
  --target_device NPU `
  --max_prompt_len 16384 `
  --max_num_seqs 1 `
  --cache_interval_multiplier 64 `
  --tool_parser hermes3

Local benchmark

Measured on the local Intel NPU setup above with the OVMS OpenAI-compatible chat completions endpoint.

Quantization	Load time	Avg output speed	Working set after runs	Private memory after runs
NF4	366.8 s	12.34 tok/s	7.42 GiB	2.57 GiB

The benchmark prompt was a short three-sentence OpenVINO explanation request with max_tokens=128.

Downloads last month: 23

Model tree for machine-made-Fibre/Qwen3-4B-Instruct-2507-heretic-OpenVINO-NF4-NPU

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1769)

this model