Qwen3-4B-Thinking-2507-heretic OpenVINO INT4 for Intel NPU

This repository contains a ready-to-run OpenVINO IR export of heretic-org/Qwen3-4B-Thinking-2507-heretic, prepared for local inference on Intel NPU through OpenVINO Model Server.

It is intended for users who want to skip the local OpenVINO conversion and weight compression steps.

Source model

  • Source model: heretic-org/Qwen3-4B-Thinking-2507-heretic
  • Original base model: Qwen/Qwen3-4B-Thinking-2507
  • Architecture: Qwen3ForCausalLM
  • Task: text generation
  • License: Apache-2.0, inherited from the source model metadata

This is not a fine-tune. It is an OpenVINO INT4 runtime export of the source model above.

OpenVINO export

The exported directory includes OpenVINO model IR files, OpenVINO tokenizer and detokenizer files, tokenizer files, chat template, and generation config.

Compression metadata from this export:

{
  "mode": "INT4_SYM",
  "group_size": -1,
  "ratio": 1.0,
  "all_layers": true
}

The model config reports max_position_embeddings: 262144. The local NPU setup used a much smaller runtime prompt limit because long prompt limits consume substantial shared memory.

Tested Intel NPU runtime

Tested locally on Windows through OVMS / OpenVINO with:

  • Target device: NPU
  • OVMS task: text_generation
  • Runtime prompt limit used: 16384
  • Max concurrent sequences: 1
  • Cache interval multiplier: 64

Example OVMS command:

ovms.exe `
  --model_path Q:/llm/models/OpenVINO/heretic-org--Qwen3-4B-Thinking-2507-heretic-text-fp16-true-int4-sym-cw-ov `
  --model_name heretic-org--Qwen3-4B-Thinking-2507-heretic-true-int4-npu `
  --rest_port 8000 `
  --rest_bind_address 0.0.0.0 `
  --task text_generation `
  --target_device NPU `
  --max_prompt_len 16384 `
  --max_num_seqs 1 `
  --cache_interval_multiplier 64 `
  --reasoning_parser qwen3 `
  --tool_parser hermes3

Local benchmark

Observed after local OVMS load at --max_prompt_len 16384:

  • OVMS process working set: about 6.22 GiB
  • OVMS private memory: about 1.45 GiB

Token speed was not separately benchmarked for this upload card. This model is a Thinking variant and may spend a large number of tokens in reasoning before emitting final content.

Known local artifact size:

  • openvino_model.bin: about 1.88 GiB

Use an Instruct variant if you need lower latency and less overthinking.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for machine-made-Fibre/Qwen3-4B-Thinking-2507-heretic-OpenVINO-INT4-NPU

Finetuned
(250)
this model