Qwen3-4B-Instruct-2507-heretic OpenVINO NF4 for Intel NPU

This repository contains a ready-to-run OpenVINO IR export of p-e-w/Qwen3-4B-Instruct-2507-heretic, prepared for local inference on Intel NPU through OpenVINO Model Server.

This is not a fine-tune. It is an OpenVINO NF4 runtime export of the source model above. In the local benchmark below, this was the best balance of speed and memory among the tested INT4, NF4, and INT8 exports.

Source model

  • Source model: p-e-w/Qwen3-4B-Instruct-2507-heretic
  • Original base model: Qwen/Qwen3-4B-Instruct-2507
  • Architecture: Qwen3ForCausalLM
  • Task: text generation
  • License: Apache-2.0, inherited from the source model metadata

OpenVINO export

Compression metadata from this export:

{
  "mode": "nf4",
  "nncf_mode": "NF4",
  "group_size": -1,
  "ratio": 1.0,
  "all_layers": true
}

Known local artifact size:

  • openvino_model.bin: about 1.88 GiB

Tested Intel NPU runtime

Tested locally on Windows with OpenVINO Model Server / OpenVINO GenAI:

  • Target device: NPU
  • OVMS task: text_generation
  • Runtime prompt limit: 16384
  • Max concurrent sequences: 1
  • Cache interval multiplier: 64

Example OVMS command:

ovms.exe `
  --model_path Q:/llm/models/OpenVINO/p-e-w--Qwen3-4B-Instruct-2507-heretic-text-fp16-nf4-cw-ov `
  --model_name p-e-w--Qwen3-4B-Instruct-2507-heretic-nf4-npu `
  --rest_port 8000 `
  --rest_bind_address 0.0.0.0 `
  --task text_generation `
  --target_device NPU `
  --max_prompt_len 16384 `
  --max_num_seqs 1 `
  --cache_interval_multiplier 64 `
  --tool_parser hermes3

Local benchmark

Measured on the local Intel NPU setup above with the OVMS OpenAI-compatible chat completions endpoint.

Quantization Load time Avg output speed Working set after runs Private memory after runs
NF4 366.8 s 12.34 tok/s 7.42 GiB 2.57 GiB

The benchmark prompt was a short three-sentence OpenVINO explanation request with max_tokens=128.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for machine-made-Fibre/Qwen3-4B-Instruct-2507-heretic-OpenVINO-NF4-NPU

Finetuned
(1769)
this model