Instructions to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
	filename="ministral-3-14b-instruct-2512-gguf-F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
# Run inference directly in the terminal:
./llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Use Docker

docker model run hf.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

LM Studio
Jan
Ollama
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Ollama:
```
ollama run hf.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
```

Unsloth Studio

How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 to start chatting

How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Docker Model Runner:
```
docker model run hf.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
```

Lemonade

How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16

Run and chat with the model

lemonade run user.ministral-3-14b-instruct-2512-gguf-F16-F16

List all available models

lemonade list

Ministral-3-14B-Instruct-2512 · GGUF F16

Converted and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.

📌 This is the full-precision F16 baseline repository. The evaluated and deployment-ready Q4_K_M variant is published at pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-Q4-K-M. That card documents the full F16 vs. Q4_K_M comparison, including a complete quantization degradation finding on toolcall_only (1.000 → 0.000).

Try the Live AI Agent Demo

Launch the PBH Applied Systems AI Agent Demo →

This model is part of the PBH Applied Systems evaluated model series that supports the live AI Agent Demo. The demo lets visitors interact with production-style agent workflows powered by open-weight language models evaluated through PBH Applied Systems' quant_eval framework.

The F16 model serves a different role than the Q4_K_M deployment variant. F16 is the full-precision baseline used to measure what the model can do before quantization. quant_eval then compares the quantized model against this baseline to identify which capabilities are preserved, which degrade, and which tasks require guardrails or a higher-precision deployment.

This comparison is central to the demo. It helps determine which model belongs in which agent role:

Reasoning models are selected for planning, analysis, and auditable decision workflows.
Document models are selected for long-context extraction, summarization, and structured Q&A.
Code models are selected for task completion, structured output, API scaffolding, and automation workflows.
Quantized variants are selected when they preserve enough behavior to reduce cost, latency, and GPU requirements.
F16 variants remain important when maximum fidelity, cleaner tool execution, or reduced quantization risk matters more than speed or cost.

The live demo shows the deployment side of that process. The F16 card documents the reference behavior. The Q4_K_M card shows what changes after compression. Together, they explain how PBH Applied Systems uses quant_eval to choose the correct LLM for the correct agent type instead of guessing from model size or leaderboard reputation.

Model Description

This repository contains the full-precision F16 GGUF of mistralai/Ministral-3-14B-Instruct-2512, a 14-billion parameter instruction-tuned model from Mistral AI (December 2025 release).

The F16 format preserves the original float16 weights without quantization. It serves two purposes in the PBH Applied Systems evaluation pipeline: as the reference baseline against which Q4_K_M capability retention is measured, and as a high-fidelity inference option for deployments where VRAM is not a constraint and maximum output quality is required.

For most production deployments, the Q4_K_M variant is the appropriate choice. The F16 is the ground truth from which quantization tradeoffs are measured.

Key Characteristics

Parameters: 14B
Format: GGUF F16 (full precision)
File size: 27.0 GB
SHA256: 74ea113134173d29f8daba097457500e831eace3741de002846b3ab89781fd52
Minimum VRAM (GPU inference): ~30 GB
Recommended GPU tier: A100 40 GB · RTX 4090 (24 GB, with offload) · 2× A10G
Context window: 32,768 tokens (per base model specification)
Inference speed (eval hardware): avg 6.37 sec/case on RTX 4090

Inference note: F16 runs at 6.37 sec/case avg vs. 3.77 sec/case for Q4_K_M — approximately 69% slower. The json_multistep family in particular averaged 17.79 sec/case at full precision due to multi-step generation length.

PBH Applied Systems Evaluation — quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260206_213615 · Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner: full_weight_transformers · Total F16 rows: 42

Note on aggregate scores: The normalized aggregate dimensions (task completion, reasoning, coherence, instruction following) are reported for the Q4_K_M variant only, as they are computed from the combined comparison run. F16 evaluation is reported at the per-family pass rate level, which is the authoritative signal for deployment decisions.

Per-Family Pass Rates — F16 (`full_weight_transformers`)

Family	N	Pass Rate	Avg Secs	Notes
json_multistep	5	0.600	17.79	3 pass, 2 fail — see case breakdown below
stateful_followup	2	1.000	3.08	Both turns parse and match expected state
toolcall_only	2	1.000*	3.81	Gating passed; schema wrapper non-compliance — see note
mixed_brief_json	2	1.000	2.42	Answer line + JSON schema correct
toolcall	2	1.000	3.54	Tool parse + schema valid
json	4	n/a	8.01	bucket_score avg = 10.000
fuzz	20	n/a	5.97	bucket_score avg = 10.000
mcq	5	n/a	0.32	bucket_score avg = 1.000

json_multistep — Case-Level Breakdown

Case	Difficulty	Result	Failure Signal
ms_easy_01	Easy	✅ PASS	—
ms_easy_02	Easy	❌ FAIL	oracle_equiv_ok=0
ms_med_01	Medium	✅ PASS	—
ms_med_02	Medium	✅ PASS	—
ms_hard_01	Hard	❌ FAIL	checks_consistent_ok=0 + oracle_equiv_ok=0

The model handles easy and medium planning cases reliably. ms_easy_02 failure is a plan divergence (model chose [A, B] where oracle expected [A, A]). ms_hard_01 failure involves an inconsistent intermediate check alongside an incorrect final placement — the harder the planning horizon, the less reliable the output without an external validator.

⚠️ toolcall_only — Schema Wrapper Non-Compliance (F16)

Pass rate: 1.000 (gating) — but schema_ok=0 on both cases.

The F16 model correctly identifies the tool and extracts valid arguments (tool_name_ok=1, args_ok=1), satisfying the gating condition for a pass. However, both cases emit a non-standard outer wrapper key ("tool") instead of the expected "tool_name", triggering schema_ok=0 and detail=schema_error.

This is a schema discipline issue, not a capability failure. The model knows what tool to call and what arguments to supply — it simply wraps them in a non-conformant key. This is qualitatively distinct from the Q4_K_M variant, which fails completely on tool_name_ok=0 and args_ok=0 simultaneously.

Signal	F16 Rate	Q4_K_M Rate	Interpretation
tool_name_ok	1.000	0.000	F16 identifies tool correctly
args_ok	1.000	0.000	F16 extracts valid args
schema_ok	0.000	0.000	Both fail outer wrapper schema
Gating pass rate	1.000	0.000	F16 passes; Q4_K_M fails entirely

Practical implication for F16 deployment: A strict schema validation layer will catch the wrapper key mismatch. Add a normalization step that maps "tool" → "tool_name" in the response parser, or instruct-tune the system prompt to enforce the exact key format. The underlying capability is intact; the schema discipline requires enforcement.

Signal-Level Diagnostics (F16)

json_multistep

Signal	Rate	Tier
schema_ok	1.000	Tier-1 (gating)
checks_consistent_ok	0.800	Tier-1 (gating)
stop_semantics_ok	1.000	Tier-1 (gating)
oracle_equiv_ok	0.600	Tier-1 (gating)
final_consistent_ok	0.000	Tier-2 (tracked, non-gating)
final_match_reported	0.000	Tier-2 (tracked, non-gating)

Note on Tier-2: final_consistent_ok and final_match_reported are not gating signals. Most deployed agentic systems compute and validate state externally. Tier-1 oracle equivalence (0.600) is the production-relevant signal.

stateful_followup

Signal	Rate	Tier
turn1_parse_ok	1.000	Tier-1
turn2_parse_ok	1.000	Tier-1
turn1_exact_match	1.000	Tier-1
turn2_exact_match	1.000	Tier-1

toolcall_only

Signal	Rate	Tier
tool_name_ok	1.000	Tier-1 (gating)
args_ok	1.000	Tier-1 (gating)
schema_ok	0.000	Non-gating (tracked)

mixed_brief_json

Signal	Rate	Tier
answer_line_ok	1.000	Tier-1
json_parse_ok	1.000	Tier-1
schema_ok	1.000	Tier-1

Recommended Use Cases — F16

✅ Deploy with Confidence (F16)

Stateful multi-turn agents — Perfect two-turn state retention (1.000). Both turns parse and match expected state exactly.
Structured JSON outputs (single-step) — bucket_score avg of 10.000 on both json and fuzz; consistently valid structured outputs.
Hybrid brief + JSON responses — mixed_brief_json passes at 1.000.
Tool-calling with response scaffolding — toolcall passes at 1.000. Tool call embedded in a broader response is fully reliable.
Tool-only dispatch with schema normalization — toolcall_only passes gating at 1.000. Add a wrapper key normalization step for strict schema compliance.
JSON multi-step with external validation loop — 0.600 pass rate; workable with an external planner or repair loop.

⚠️ Use with Guardrails (F16)

Strict bare tool-call dispatch — Schema wrapper non-compliance (schema_ok=0) requires a normalization layer for systems enforcing exact JSON key format.
Hard multi-step planning without validation — ms_hard_01 fails at both check consistency and oracle equivalence.

❌ Not Recommended (F16)

Unassisted multi-step planning — Where planning correctness must hold without external verification or oracle validation, particularly at medium-to-hard difficulty.

Hardware Requirements

Configuration	VRAM Required	Recommended GPU
F16 (this repo) · full GPU offload	~30 GB	A100 40 GB · 2× A10G · RTX 4090 (partial offload)
F16 · mixed CPU/GPU offload	16–24 GB VRAM + 16 GB RAM	RTX 3090/4090 with `n_gpu_layers` tuning
Q4_K_M (companion repo)	~10–12 GB	T4 16 GB · RTX 3080/4080 · A10G

For most production use cases, the Q4_K_M variant at ~10–12 GB VRAM and 3.77 sec/case is the appropriate deployment target. The F16 is recommended when maximum output fidelity is required and hardware constraints allow, or when using the model as the reference baseline in a quantization evaluation pipeline.

Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — llama-cpp-python

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Download F16 GGUF directly from HuggingFace Hub
# Note: 27 GB download — ensure sufficient disk space and ~30 GB VRAM
model_path = hf_hub_download(
    repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
    filename="ministral-3-14b-instruct-2512-gguf-F16.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,          # context window; increase up to 32768 per model spec
    n_gpu_layers=-1,     # -1 offloads all layers to GPU; reduce if VRAM < 30 GB
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful, concise assistant. Respond in structured JSON when asked."
        },
        {
            "role": "user",
            "content": "Summarize the following contract clause and flag any obligations: ..."
        }
    ],
    temperature=0.15,
    max_tokens=1024,
)

print(response["choices"][0]["message"]["content"])

For partial GPU offload when VRAM is between 16–24 GB:

llm = Llama(
    model_path=model_path,
    n_ctx=4096,
    n_gpu_layers=20,   # Tune based on available VRAM; remainder runs on CPU
    verbose=True,      # Enable to monitor layer offload and memory usage
)

For tool-calling with schema normalization (addressing the wrapper non-compliance noted above):

import json, re
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
    filename="ministral-3-14b-instruct-2512-gguf-F16.gguf"
)

llm = Llama(model_path=model_path, n_ctx=4096, n_gpu_layers=-1, verbose=False)

def normalize_tool_wrapper(raw: str) -> dict:
    """
    Normalize F16 schema wrapper non-compliance.
    Maps non-standard 'tool' key -> 'tool_name' before validation.
    See quant_eval v7.21 toolcall_only finding: schema_ok=0 on both F16 cases.
    """
    # Extract JSON block from markdown fences if present
    match = re.search(r'```(?:json)?\s*([\s\S]*?)```', raw)
    payload = match.group(1).strip() if match else raw.strip()
    parsed = json.loads(payload)
    # Normalize wrapper key
    if "tool" in parsed and "tool_name" not in parsed:
        parsed["tool_name"] = parsed.pop("tool")
    assert "tool_name" in parsed and "args" in parsed
    return parsed

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "Respond only with a valid JSON tool call."},
        {"role": "user", "content": "Add 5 and 10."}
    ],
    temperature=0.0,
    max_tokens=256,
)
raw = response["choices"][0]["message"]["content"]
result = normalize_tool_wrapper(raw)
print(result)

CLI — llama-cli

# One-shot prompt (ensure sufficient VRAM before running)
llama-cli \
  --model ministral-3-14b-instruct-2512-gguf-F16.gguf \
  --chat-template mistral \
  --system-prompt "You are a helpful assistant." \
  --prompt "Summarize the following and return a JSON object with keys: summary, risk_level, action_items." \
  --n-predict 512 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.15

For server deployment (OpenAI-compatible endpoint):

llama-server \
  --model ministral-3-14b-instruct-2512-gguf-F16.gguf \
  --chat-template mistral \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="ministral-3-14b-instruct-2512-gguf-F16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.15,
)
print(response.choices[0].message.content)

Artifact Provenance

Artifact	Format	Size	SHA256
`ministral-3-14b-instruct-2512-gguf-F16.gguf`	GGUF F16	27.0 GB	`74ea113134173d29f8daba097457500e831eace3741de002846b3ab89781fd52`
Q4_K_M (companion repo)	GGUF Q4_K_M	8.24 GB	`a23910514ee512aa28db8dddd390c26a73b9c318dcdec374ae02d722d9658749`

Both artifacts were produced from mistralai/Ministral-3-14B-Instruct-2512 using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems. Conversion was performed on the full HuggingFace snapshot without modification to model weights prior to conversion.

Evaluation Methodology

quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems. The F16 evaluation run (20260206_213615) produces the full_weight_cache.json used as the reference baseline in the subsequent Q4_K_M comparison run (20260209_170235). This two-run architecture — F16 first, Q4_K_M second against the cached F16 results — enables exact apples-to-apples comparison of capability retention across quantization levels on an identical fixture set.

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Task families evaluated:

Family	Description	Pass Signals
`fuzz`	Property-based regression; structured placement correctness	schema_ok, constraints_ok
`json`	Single-step structured JSON with constraint rules	schema_ok, constraints_ok
`json_multistep`	Multi-step planning with self-check and oracle verification	schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
`mcq`	Multiple-choice extraction	choice_ok
`stateful_followup`	Two-turn state tracking; turn-2 correct given turn-1	turn1/2_parse_ok, turn1/2_exact_match
`mixed_brief_json`	Hybrid: natural language answer + valid JSON block	answer_line_ok, json_parse_ok, schema_ok
`toolcall`	Tool call embedded in response; parse + schema validation	stage1_tool_parse_ok, stage1_tool_schema_ok
`toolcall_only`	Bare schema-only tool call; strict tool name + args check	tool_name_ok, args_ok

Scores are conservative conjunctions — a case passes only when all gating signals succeed.

Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) F16 evaluation date: February 6, 2026 quant_eval seed: 42

🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com

Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com

About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints — particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.

Founder — Patrick Hill, M.S.

PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.

Technical expertise spans:

Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture

Published Author

Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies — a 1,200+ page practitioner-oriented textbook covering statistical modeling, supervised and unsupervised learning, neural networks, NLP, and real-world decision-support case studies. The text has been adopted as required reading for CSC 373 – Machine Learning at the University of Advancing Technology, and reflects the same philosophy applied across all PBH systems: prioritize practical correctness over theoretical novelty, favor interpretable and reliable solutions, and introduce complexity only when justified by data and deployment constraints.

Core Service Areas

1. LLM Optimization & Deployment End-to-end conversion of full-weight HuggingFace models to production-ready GGUF format, with quantization strategies matched to target hardware and latency requirements. Custom-built llama.cpp pipelines with adapter-per-model architecture ensuring strict separation of concerns and universal cross-model compatibility.

2. AI Evaluation Frameworks Proprietary behavioral evaluation via quant_eval — multi-run, timestamped pipelines producing structured artifacts, SHA256-verified manifests, per-family pass rates, F16 vs. quantized delta analysis, and deployment-ready recommendations. Evaluation batteries cover structured JSON output, multi-step reasoning, tool-calling fidelity, MCQ benchmarking, and fuzz/regression testing.

3. Agentic AI Infrastructure Design and deployment of agent-oriented architectures using LlamaIndex ReAct agents, Flask orchestration layers, and serverless GPU inference. Full pipeline from model selection through quantization, evaluation, and production serving — including lead capture flows, budget controls, and API gateway integration.

4. Scalable AI Application Development Production-grade multimodal AI applications integrating quantized LLMs, Whisper (speech-to-text), and BLIP (vision) via modular Flask APIs with Dockerized deployment and streaming-style responses. Advanced time-series forecasting systems featuring custom lightweight attention mechanisms, ensemble meta-learning, Bayesian hyperparameter optimization with resource-aware OOM backoff, and FinBERT sentiment fusion for hybrid structured/unstructured data pipelines.

5. ML Pipeline Design & Analytics End-to-end data and model pipelines engineered for decision-support and operational forecasting. Encompasses feature engineering, leak-free forward-chaining cross-validation, KPI dashboard development, and analytical governance procedures designed for reproducibility at scale. Proven track record of translating complex model outputs into actionable insights for senior stakeholders across large-scale operational datasets.

6. Model & Agent Cataloging Structured model catalog publishing with reproducible artifacts, standardized reporting, and clear performance tradeoff documentation — enabling engineering teams to make informed deployment decisions without re-running evaluations from scratch.

Engineering Principles

Reproducibility first — Every run produces structured artifacts, versioned manifests, and comparable outputs
Universality as a requirement — Systems work across models without custom rewrites per deployment
No silent behavior changes — Evaluation logic, prompts, and workflows are locked and versioned
GPU utilization is non-negotiable — All pipelines are designed to fully leverage available hardware
Separation of IP and operations — Core intellectual property is maintained independently of client deliverables

📞 Work With PBH Applied Systems

This F16 card documents what the model can do at full precision. The Q4_K_M companion card documents what degrades when you quantize — including a complete toolcall_only failure (1.000 → 0.000) that is invisible without running both evaluations. Without evaluating both formats against the same fixture set before deployment, you are making a deployment decision without the data to support it.

👉 Book a Scoping Call — Discuss your model selection, quantization strategy, or deployment architecture directly with Patrick.

👉 Request an Evaluation Report — A full quant_eval behavioral audit for your target model(s): per-family pass rates, F16 vs. quantized delta analysis, failure cluster diagnostics, and a deployment recommendation. Engagements from $2,500.

Connect


🌐 Website	pbhappliedsystems.com
📧 Email	patrick@pbhappliedsystems.com
💼 LinkedIn	PBH Applied Systems, LLC
▶️ YouTube	@pbhappliedsystems
📸 Instagram	@pbhappliedsystems
👍 Facebook	pbhappliedsystems

License

This GGUF repository inherits the license of the base model: Apache 2.0 — mistralai/Ministral-3-14B-Instruct-2512

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.

GGUF conversion and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · F16 Run ID: 20260206_213615

Downloads last month: 15

GGUF

Model size

14B params

Architecture

mistral3

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16

Base model

mistralai/Ministral-3-14B-Base-2512

Quantized

mistralai/Ministral-3-14B-Instruct-2512