Qwen3-4B-Thinking-2507-heretic OpenVINO INT4 for Intel NPU

This repository contains a ready-to-run OpenVINO IR export of heretic-org/Qwen3-4B-Thinking-2507-heretic, prepared for local inference on Intel NPU through OpenVINO Model Server.

It is intended for users who want to skip the local OpenVINO conversion and weight compression steps.

Source model

Source model: heretic-org/Qwen3-4B-Thinking-2507-heretic
Original base model: Qwen/Qwen3-4B-Thinking-2507
Architecture: Qwen3ForCausalLM
Task: text generation
License: Apache-2.0, inherited from the source model metadata

This is not a fine-tune. It is an OpenVINO INT4 runtime export of the source model above.

OpenVINO export

The exported directory includes OpenVINO model IR files, OpenVINO tokenizer and detokenizer files, tokenizer files, chat template, and generation config.

Compression metadata from this export:

{
  "mode": "INT4_SYM",
  "group_size": -1,
  "ratio": 1.0,
  "all_layers": true
}

The model config reports max_position_embeddings: 262144. The local NPU setup used a much smaller runtime prompt limit because long prompt limits consume substantial shared memory.

Tested Intel NPU runtime

Tested locally on Windows through OVMS / OpenVINO with:

Target device: NPU
OVMS task: text_generation
Runtime prompt limit used: 16384
Max concurrent sequences: 1
Cache interval multiplier: 64

Example OVMS command:

ovms.exe `
  --model_path Q:/llm/models/OpenVINO/heretic-org--Qwen3-4B-Thinking-2507-heretic-text-fp16-true-int4-sym-cw-ov `
  --model_name heretic-org--Qwen3-4B-Thinking-2507-heretic-true-int4-npu `
  --rest_port 8000 `
  --rest_bind_address 0.0.0.0 `
  --task text_generation `
  --target_device NPU `
  --max_prompt_len 16384 `
  --max_num_seqs 1 `
  --cache_interval_multiplier 64 `
  --reasoning_parser qwen3 `
  --tool_parser hermes3

Local benchmark

Observed after local OVMS load at --max_prompt_len 16384:

OVMS process working set: about 6.22 GiB
OVMS private memory: about 1.45 GiB

Token speed was not separately benchmarked for this upload card. This model is a Thinking variant and may spend a large number of tokens in reasoning before emitting final content.

Known local artifact size:

openvino_model.bin: about 1.88 GiB

Use an Instruct variant if you need lower latency and less overthinking.

Downloads last month: 19

Model tree for machine-made-Fibre/Qwen3-4B-Thinking-2507-heretic-OpenVINO-INT4-NPU

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

(250)

this model