Ministral-3-14B-Instruct-2512 · GGUF F16

Converted and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.

📌 This is the full-precision F16 baseline repository. The evaluated and deployment-ready Q4_K_M variant is published at pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-Q4-K-M. That card documents the full F16 vs. Q4_K_M comparison, including a complete quantization degradation finding on toolcall_only (1.000 → 0.000).


Try the Live AI Agent Demo

Launch the PBH Applied Systems AI Agent Demo →

This model is part of the PBH Applied Systems evaluated model series that supports the live AI Agent Demo. The demo lets visitors interact with production-style agent workflows powered by open-weight language models evaluated through PBH Applied Systems' quant_eval framework.

The F16 model serves a different role than the Q4_K_M deployment variant. F16 is the full-precision baseline used to measure what the model can do before quantization. quant_eval then compares the quantized model against this baseline to identify which capabilities are preserved, which degrade, and which tasks require guardrails or a higher-precision deployment.

This comparison is central to the demo. It helps determine which model belongs in which agent role:

  • Reasoning models are selected for planning, analysis, and auditable decision workflows.
  • Document models are selected for long-context extraction, summarization, and structured Q&A.
  • Code models are selected for task completion, structured output, API scaffolding, and automation workflows.
  • Quantized variants are selected when they preserve enough behavior to reduce cost, latency, and GPU requirements.
  • F16 variants remain important when maximum fidelity, cleaner tool execution, or reduced quantization risk matters more than speed or cost.

The live demo shows the deployment side of that process. The F16 card documents the reference behavior. The Q4_K_M card shows what changes after compression. Together, they explain how PBH Applied Systems uses quant_eval to choose the correct LLM for the correct agent type instead of guessing from model size or leaderboard reputation.


Model Description

This repository contains the full-precision F16 GGUF of mistralai/Ministral-3-14B-Instruct-2512, a 14-billion parameter instruction-tuned model from Mistral AI (December 2025 release).

The F16 format preserves the original float16 weights without quantization. It serves two purposes in the PBH Applied Systems evaluation pipeline: as the reference baseline against which Q4_K_M capability retention is measured, and as a high-fidelity inference option for deployments where VRAM is not a constraint and maximum output quality is required.

For most production deployments, the Q4_K_M variant is the appropriate choice. The F16 is the ground truth from which quantization tradeoffs are measured.

Key Characteristics

  • Parameters: 14B
  • Format: GGUF F16 (full precision)
  • File size: 27.0 GB
  • SHA256: 74ea113134173d29f8daba097457500e831eace3741de002846b3ab89781fd52
  • Minimum VRAM (GPU inference): ~30 GB
  • Recommended GPU tier: A100 40 GB · RTX 4090 (24 GB, with offload) · 2× A10G
  • Context window: 32,768 tokens (per base model specification)
  • Inference speed (eval hardware): avg 6.37 sec/case on RTX 4090

Inference note: F16 runs at 6.37 sec/case avg vs. 3.77 sec/case for Q4_K_M — approximately 69% slower. The json_multistep family in particular averaged 17.79 sec/case at full precision due to multi-step generation length.


PBH Applied Systems Evaluation — quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260206_213615 · Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner: full_weight_transformers · Total F16 rows: 42

Note on aggregate scores: The normalized aggregate dimensions (task completion, reasoning, coherence, instruction following) are reported for the Q4_K_M variant only, as they are computed from the combined comparison run. F16 evaluation is reported at the per-family pass rate level, which is the authoritative signal for deployment decisions.

Per-Family Pass Rates — F16 (full_weight_transformers)

Family N Pass Rate Avg Secs Notes
json_multistep 5 0.600 17.79 3 pass, 2 fail — see case breakdown below
stateful_followup 2 1.000 3.08 Both turns parse and match expected state
toolcall_only 2 1.000* 3.81 Gating passed; schema wrapper non-compliance — see note
mixed_brief_json 2 1.000 2.42 Answer line + JSON schema correct
toolcall 2 1.000 3.54 Tool parse + schema valid
json 4 n/a 8.01 bucket_score avg = 10.000
fuzz 20 n/a 5.97 bucket_score avg = 10.000
mcq 5 n/a 0.32 bucket_score avg = 1.000

json_multistep — Case-Level Breakdown

Case Difficulty Result Failure Signal
ms_easy_01 Easy ✅ PASS
ms_easy_02 Easy ❌ FAIL oracle_equiv_ok=0
ms_med_01 Medium ✅ PASS
ms_med_02 Medium ✅ PASS
ms_hard_01 Hard ❌ FAIL checks_consistent_ok=0 + oracle_equiv_ok=0

The model handles easy and medium planning cases reliably. ms_easy_02 failure is a plan divergence (model chose [A, B] where oracle expected [A, A]). ms_hard_01 failure involves an inconsistent intermediate check alongside an incorrect final placement — the harder the planning horizon, the less reliable the output without an external validator.

⚠️ toolcall_only — Schema Wrapper Non-Compliance (F16)

Pass rate: 1.000 (gating) — but schema_ok=0 on both cases.

The F16 model correctly identifies the tool and extracts valid arguments (tool_name_ok=1, args_ok=1), satisfying the gating condition for a pass. However, both cases emit a non-standard outer wrapper key ("tool") instead of the expected "tool_name", triggering schema_ok=0 and detail=schema_error.

This is a schema discipline issue, not a capability failure. The model knows what tool to call and what arguments to supply — it simply wraps them in a non-conformant key. This is qualitatively distinct from the Q4_K_M variant, which fails completely on tool_name_ok=0 and args_ok=0 simultaneously.

Signal F16 Rate Q4_K_M Rate Interpretation
tool_name_ok 1.000 0.000 F16 identifies tool correctly
args_ok 1.000 0.000 F16 extracts valid args
schema_ok 0.000 0.000 Both fail outer wrapper schema
Gating pass rate 1.000 0.000 F16 passes; Q4_K_M fails entirely

Practical implication for F16 deployment: A strict schema validation layer will catch the wrapper key mismatch. Add a normalization step that maps "tool""tool_name" in the response parser, or instruct-tune the system prompt to enforce the exact key format. The underlying capability is intact; the schema discipline requires enforcement.

Signal-Level Diagnostics (F16)

json_multistep

Signal Rate Tier
schema_ok 1.000 Tier-1 (gating)
checks_consistent_ok 0.800 Tier-1 (gating)
stop_semantics_ok 1.000 Tier-1 (gating)
oracle_equiv_ok 0.600 Tier-1 (gating)
final_consistent_ok 0.000 Tier-2 (tracked, non-gating)
final_match_reported 0.000 Tier-2 (tracked, non-gating)

Note on Tier-2: final_consistent_ok and final_match_reported are not gating signals. Most deployed agentic systems compute and validate state externally. Tier-1 oracle equivalence (0.600) is the production-relevant signal.

stateful_followup

Signal Rate Tier
turn1_parse_ok 1.000 Tier-1
turn2_parse_ok 1.000 Tier-1
turn1_exact_match 1.000 Tier-1
turn2_exact_match 1.000 Tier-1

toolcall_only

Signal Rate Tier
tool_name_ok 1.000 Tier-1 (gating)
args_ok 1.000 Tier-1 (gating)
schema_ok 0.000 Non-gating (tracked)

mixed_brief_json

Signal Rate Tier
answer_line_ok 1.000 Tier-1
json_parse_ok 1.000 Tier-1
schema_ok 1.000 Tier-1

Recommended Use Cases — F16

✅ Deploy with Confidence (F16)

  • Stateful multi-turn agents — Perfect two-turn state retention (1.000). Both turns parse and match expected state exactly.
  • Structured JSON outputs (single-step) — bucket_score avg of 10.000 on both json and fuzz; consistently valid structured outputs.
  • Hybrid brief + JSON responsesmixed_brief_json passes at 1.000.
  • Tool-calling with response scaffoldingtoolcall passes at 1.000. Tool call embedded in a broader response is fully reliable.
  • Tool-only dispatch with schema normalizationtoolcall_only passes gating at 1.000. Add a wrapper key normalization step for strict schema compliance.
  • JSON multi-step with external validation loop — 0.600 pass rate; workable with an external planner or repair loop.

⚠️ Use with Guardrails (F16)

  • Strict bare tool-call dispatch — Schema wrapper non-compliance (schema_ok=0) requires a normalization layer for systems enforcing exact JSON key format.
  • Hard multi-step planning without validationms_hard_01 fails at both check consistency and oracle equivalence.

❌ Not Recommended (F16)

  • Unassisted multi-step planning — Where planning correctness must hold without external verification or oracle validation, particularly at medium-to-hard difficulty.

Hardware Requirements

Configuration VRAM Required Recommended GPU
F16 (this repo) · full GPU offload ~30 GB A100 40 GB · 2× A10G · RTX 4090 (partial offload)
F16 · mixed CPU/GPU offload 16–24 GB VRAM + 16 GB RAM RTX 3090/4090 with n_gpu_layers tuning
Q4_K_M (companion repo) ~10–12 GB T4 16 GB · RTX 3080/4080 · A10G

For most production use cases, the Q4_K_M variant at ~10–12 GB VRAM and 3.77 sec/case is the appropriate deployment target. The F16 is recommended when maximum output fidelity is required and hardware constraints allow, or when using the model as the reference baseline in a quantization evaluation pipeline.


Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — llama-cpp-python

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Download F16 GGUF directly from HuggingFace Hub
# Note: 27 GB download — ensure sufficient disk space and ~30 GB VRAM
model_path = hf_hub_download(
    repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
    filename="ministral-3-14b-instruct-2512-gguf-F16.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,          # context window; increase up to 32768 per model spec
    n_gpu_layers=-1,     # -1 offloads all layers to GPU; reduce if VRAM < 30 GB
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful, concise assistant. Respond in structured JSON when asked."
        },
        {
            "role": "user",
            "content": "Summarize the following contract clause and flag any obligations: ..."
        }
    ],
    temperature=0.15,
    max_tokens=1024,
)

print(response["choices"][0]["message"]["content"])

For partial GPU offload when VRAM is between 16–24 GB:

llm = Llama(
    model_path=model_path,
    n_ctx=4096,
    n_gpu_layers=20,   # Tune based on available VRAM; remainder runs on CPU
    verbose=True,      # Enable to monitor layer offload and memory usage
)

For tool-calling with schema normalization (addressing the wrapper non-compliance noted above):

import json, re
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
    filename="ministral-3-14b-instruct-2512-gguf-F16.gguf"
)

llm = Llama(model_path=model_path, n_ctx=4096, n_gpu_layers=-1, verbose=False)

def normalize_tool_wrapper(raw: str) -> dict:
    """
    Normalize F16 schema wrapper non-compliance.
    Maps non-standard 'tool' key -> 'tool_name' before validation.
    See quant_eval v7.21 toolcall_only finding: schema_ok=0 on both F16 cases.
    """
    # Extract JSON block from markdown fences if present
    match = re.search(r'```(?:json)?\s*([\s\S]*?)```', raw)
    payload = match.group(1).strip() if match else raw.strip()
    parsed = json.loads(payload)
    # Normalize wrapper key
    if "tool" in parsed and "tool_name" not in parsed:
        parsed["tool_name"] = parsed.pop("tool")
    assert "tool_name" in parsed and "args" in parsed
    return parsed

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "Respond only with a valid JSON tool call."},
        {"role": "user", "content": "Add 5 and 10."}
    ],
    temperature=0.0,
    max_tokens=256,
)
raw = response["choices"][0]["message"]["content"]
result = normalize_tool_wrapper(raw)
print(result)

CLI — llama-cli

# One-shot prompt (ensure sufficient VRAM before running)
llama-cli \
  --model ministral-3-14b-instruct-2512-gguf-F16.gguf \
  --chat-template mistral \
  --system-prompt "You are a helpful assistant." \
  --prompt "Summarize the following and return a JSON object with keys: summary, risk_level, action_items." \
  --n-predict 512 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.15

For server deployment (OpenAI-compatible endpoint):

llama-server \
  --model ministral-3-14b-instruct-2512-gguf-F16.gguf \
  --chat-template mistral \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="ministral-3-14b-instruct-2512-gguf-F16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.15,
)
print(response.choices[0].message.content)

Artifact Provenance

Artifact Format Size SHA256
ministral-3-14b-instruct-2512-gguf-F16.gguf GGUF F16 27.0 GB 74ea113134173d29f8daba097457500e831eace3741de002846b3ab89781fd52
Q4_K_M (companion repo) GGUF Q4_K_M 8.24 GB a23910514ee512aa28db8dddd390c26a73b9c318dcdec374ae02d722d9658749

Both artifacts were produced from mistralai/Ministral-3-14B-Instruct-2512 using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems. Conversion was performed on the full HuggingFace snapshot without modification to model weights prior to conversion.


Evaluation Methodology

quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems. The F16 evaluation run (20260206_213615) produces the full_weight_cache.json used as the reference baseline in the subsequent Q4_K_M comparison run (20260209_170235). This two-run architecture — F16 first, Q4_K_M second against the cached F16 results — enables exact apples-to-apples comparison of capability retention across quantization levels on an identical fixture set.

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Task families evaluated:

Family Description Pass Signals
fuzz Property-based regression; structured placement correctness schema_ok, constraints_ok
json Single-step structured JSON with constraint rules schema_ok, constraints_ok
json_multistep Multi-step planning with self-check and oracle verification schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
mcq Multiple-choice extraction choice_ok
stateful_followup Two-turn state tracking; turn-2 correct given turn-1 turn1/2_parse_ok, turn1/2_exact_match
mixed_brief_json Hybrid: natural language answer + valid JSON block answer_line_ok, json_parse_ok, schema_ok
toolcall Tool call embedded in response; parse + schema validation stage1_tool_parse_ok, stage1_tool_schema_ok
toolcall_only Bare schema-only tool call; strict tool name + args check tool_name_ok, args_ok

Scores are conservative conjunctions — a case passes only when all gating signals succeed.

Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) F16 evaluation date: February 6, 2026 quant_eval seed: 42


🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com


Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com


About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints — particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.

Founder — Patrick Hill, M.S.

PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.

Technical expertise spans:

  • Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
  • ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
  • AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
  • Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
  • Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
  • Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture

Published Author

Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies — a 1,200+ page practitioner-oriented textbook covering statistical modeling, supervised and unsupervised learning, neural networks, NLP, and real-world decision-support case studies. The text has been adopted as required reading for CSC 373 – Machine Learning at the University of Advancing Technology, and reflects the same philosophy applied across all PBH systems: prioritize practical correctness over theoretical novelty, favor interpretable and reliable solutions, and introduce complexity only when justified by data and deployment constraints.

Core Service Areas

1. LLM Optimization & Deployment End-to-end conversion of full-weight HuggingFace models to production-ready GGUF format, with quantization strategies matched to target hardware and latency requirements. Custom-built llama.cpp pipelines with adapter-per-model architecture ensuring strict separation of concerns and universal cross-model compatibility.

2. AI Evaluation Frameworks Proprietary behavioral evaluation via quant_eval — multi-run, timestamped pipelines producing structured artifacts, SHA256-verified manifests, per-family pass rates, F16 vs. quantized delta analysis, and deployment-ready recommendations. Evaluation batteries cover structured JSON output, multi-step reasoning, tool-calling fidelity, MCQ benchmarking, and fuzz/regression testing.

3. Agentic AI Infrastructure Design and deployment of agent-oriented architectures using LlamaIndex ReAct agents, Flask orchestration layers, and serverless GPU inference. Full pipeline from model selection through quantization, evaluation, and production serving — including lead capture flows, budget controls, and API gateway integration.

4. Scalable AI Application Development Production-grade multimodal AI applications integrating quantized LLMs, Whisper (speech-to-text), and BLIP (vision) via modular Flask APIs with Dockerized deployment and streaming-style responses. Advanced time-series forecasting systems featuring custom lightweight attention mechanisms, ensemble meta-learning, Bayesian hyperparameter optimization with resource-aware OOM backoff, and FinBERT sentiment fusion for hybrid structured/unstructured data pipelines.

5. ML Pipeline Design & Analytics End-to-end data and model pipelines engineered for decision-support and operational forecasting. Encompasses feature engineering, leak-free forward-chaining cross-validation, KPI dashboard development, and analytical governance procedures designed for reproducibility at scale. Proven track record of translating complex model outputs into actionable insights for senior stakeholders across large-scale operational datasets.

6. Model & Agent Cataloging Structured model catalog publishing with reproducible artifacts, standardized reporting, and clear performance tradeoff documentation — enabling engineering teams to make informed deployment decisions without re-running evaluations from scratch.

Engineering Principles

  • Reproducibility first — Every run produces structured artifacts, versioned manifests, and comparable outputs
  • Universality as a requirement — Systems work across models without custom rewrites per deployment
  • No silent behavior changes — Evaluation logic, prompts, and workflows are locked and versioned
  • GPU utilization is non-negotiable — All pipelines are designed to fully leverage available hardware
  • Separation of IP and operations — Core intellectual property is maintained independently of client deliverables

📞 Work With PBH Applied Systems

This F16 card documents what the model can do at full precision. The Q4_K_M companion card documents what degrades when you quantize — including a complete toolcall_only failure (1.000 → 0.000) that is invisible without running both evaluations. Without evaluating both formats against the same fixture set before deployment, you are making a deployment decision without the data to support it.

👉 Book a Scoping Call — Discuss your model selection, quantization strategy, or deployment architecture directly with Patrick.

👉 Request an Evaluation Report — A full quant_eval behavioral audit for your target model(s): per-family pass rates, F16 vs. quantized delta analysis, failure cluster diagnostics, and a deployment recommendation. Engagements from $2,500.

Connect

🌐 Website pbhappliedsystems.com
📧 Email patrick@pbhappliedsystems.com
💼 LinkedIn PBH Applied Systems, LLC
▶️ YouTube @pbhappliedsystems
📸 Instagram @pbhappliedsystems
👍 Facebook pbhappliedsystems

License

This GGUF repository inherits the license of the base model: Apache 2.0mistralai/Ministral-3-14B-Instruct-2512

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.


GGUF conversion and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · F16 Run ID: 20260206_213615

Downloads last month
15
GGUF
Model size
14B params
Architecture
mistral3
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16