Instructions to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16", filename="ministral-3-14b-instruct-2512-gguf-F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16 # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16 # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16 # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Use Docker
docker model run hf.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Ollama:
ollama run hf.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
- Unsloth Studio
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 to start chatting
- Pi
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
- Lemonade
How to use pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16:F16
Run and chat with the model
lemonade run user.ministral-3-14b-instruct-2512-gguf-F16-F16
List all available models
lemonade list
- Ministral-3-14B-Instruct-2512 · GGUF F16
- Try the Live AI Agent Demo
- Model Description
- PBH Applied Systems Evaluation — quant_eval v7.21
- Recommended Use Cases — F16
- Hardware Requirements
- Usage
- Artifact Provenance
- Evaluation Methodology
- 🔬 About quant_eval & This Evaluation Series
- About PBH Applied Systems
- 📞 Work With PBH Applied Systems
- License
- Try the Live AI Agent Demo
Ministral-3-14B-Instruct-2512 · GGUF F16
Converted and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure
🔬 This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.
📌 This is the full-precision F16 baseline repository. The evaluated and deployment-ready Q4_K_M variant is published at
pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-Q4-K-M. That card documents the full F16 vs. Q4_K_M comparison, including a complete quantization degradation finding ontoolcall_only(1.000 → 0.000).
Try the Live AI Agent Demo
Launch the PBH Applied Systems AI Agent Demo →
This model is part of the PBH Applied Systems evaluated model series that supports the live AI Agent Demo. The demo lets visitors interact with production-style agent workflows powered by open-weight language models evaluated through PBH Applied Systems' quant_eval framework.
The F16 model serves a different role than the Q4_K_M deployment variant. F16 is the full-precision baseline used to measure what the model can do before quantization. quant_eval then compares the quantized model against this baseline to identify which capabilities are preserved, which degrade, and which tasks require guardrails or a higher-precision deployment.
This comparison is central to the demo. It helps determine which model belongs in which agent role:
- Reasoning models are selected for planning, analysis, and auditable decision workflows.
- Document models are selected for long-context extraction, summarization, and structured Q&A.
- Code models are selected for task completion, structured output, API scaffolding, and automation workflows.
- Quantized variants are selected when they preserve enough behavior to reduce cost, latency, and GPU requirements.
- F16 variants remain important when maximum fidelity, cleaner tool execution, or reduced quantization risk matters more than speed or cost.
The live demo shows the deployment side of that process. The F16 card documents the reference behavior. The Q4_K_M card shows what changes after compression. Together, they explain how PBH Applied Systems uses quant_eval to choose the correct LLM for the correct agent type instead of guessing from model size or leaderboard reputation.
Model Description
This repository contains the full-precision F16 GGUF of mistralai/Ministral-3-14B-Instruct-2512, a 14-billion parameter instruction-tuned model from Mistral AI (December 2025 release).
The F16 format preserves the original float16 weights without quantization. It serves two purposes in the PBH Applied Systems evaluation pipeline: as the reference baseline against which Q4_K_M capability retention is measured, and as a high-fidelity inference option for deployments where VRAM is not a constraint and maximum output quality is required.
For most production deployments, the Q4_K_M variant is the appropriate choice. The F16 is the ground truth from which quantization tradeoffs are measured.
Key Characteristics
- Parameters: 14B
- Format: GGUF F16 (full precision)
- File size: 27.0 GB
- SHA256:
74ea113134173d29f8daba097457500e831eace3741de002846b3ab89781fd52 - Minimum VRAM (GPU inference): ~30 GB
- Recommended GPU tier: A100 40 GB · RTX 4090 (24 GB, with offload) · 2× A10G
- Context window: 32,768 tokens (per base model specification)
- Inference speed (eval hardware): avg 6.37 sec/case on RTX 4090
Inference note: F16 runs at 6.37 sec/case avg vs. 3.77 sec/case for Q4_K_M — approximately 69% slower. The json_multistep family in particular averaged 17.79 sec/case at full precision due to multi-step generation length.
PBH Applied Systems Evaluation — quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260206_213615· Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner:full_weight_transformers· Total F16 rows: 42
Note on aggregate scores: The normalized aggregate dimensions (task completion, reasoning, coherence, instruction following) are reported for the Q4_K_M variant only, as they are computed from the combined comparison run. F16 evaluation is reported at the per-family pass rate level, which is the authoritative signal for deployment decisions.
Per-Family Pass Rates — F16 (full_weight_transformers)
| Family | N | Pass Rate | Avg Secs | Notes |
|---|---|---|---|---|
| json_multistep | 5 | 0.600 | 17.79 | 3 pass, 2 fail — see case breakdown below |
| stateful_followup | 2 | 1.000 | 3.08 | Both turns parse and match expected state |
| toolcall_only | 2 | 1.000* | 3.81 | Gating passed; schema wrapper non-compliance — see note |
| mixed_brief_json | 2 | 1.000 | 2.42 | Answer line + JSON schema correct |
| toolcall | 2 | 1.000 | 3.54 | Tool parse + schema valid |
| json | 4 | n/a | 8.01 | bucket_score avg = 10.000 |
| fuzz | 20 | n/a | 5.97 | bucket_score avg = 10.000 |
| mcq | 5 | n/a | 0.32 | bucket_score avg = 1.000 |
json_multistep — Case-Level Breakdown
| Case | Difficulty | Result | Failure Signal |
|---|---|---|---|
| ms_easy_01 | Easy | ✅ PASS | — |
| ms_easy_02 | Easy | ❌ FAIL | oracle_equiv_ok=0 |
| ms_med_01 | Medium | ✅ PASS | — |
| ms_med_02 | Medium | ✅ PASS | — |
| ms_hard_01 | Hard | ❌ FAIL | checks_consistent_ok=0 + oracle_equiv_ok=0 |
The model handles easy and medium planning cases reliably. ms_easy_02 failure is a plan divergence (model chose [A, B] where oracle expected [A, A]). ms_hard_01 failure involves an inconsistent intermediate check alongside an incorrect final placement — the harder the planning horizon, the less reliable the output without an external validator.
⚠️ toolcall_only — Schema Wrapper Non-Compliance (F16)
Pass rate: 1.000 (gating) — but schema_ok=0 on both cases.
The F16 model correctly identifies the tool and extracts valid arguments (tool_name_ok=1, args_ok=1), satisfying the gating condition for a pass. However, both cases emit a non-standard outer wrapper key ("tool") instead of the expected "tool_name", triggering schema_ok=0 and detail=schema_error.
This is a schema discipline issue, not a capability failure. The model knows what tool to call and what arguments to supply — it simply wraps them in a non-conformant key. This is qualitatively distinct from the Q4_K_M variant, which fails completely on tool_name_ok=0 and args_ok=0 simultaneously.
| Signal | F16 Rate | Q4_K_M Rate | Interpretation |
|---|---|---|---|
| tool_name_ok | 1.000 | 0.000 | F16 identifies tool correctly |
| args_ok | 1.000 | 0.000 | F16 extracts valid args |
| schema_ok | 0.000 | 0.000 | Both fail outer wrapper schema |
| Gating pass rate | 1.000 | 0.000 | F16 passes; Q4_K_M fails entirely |
Practical implication for F16 deployment: A strict schema validation layer will catch the wrapper key mismatch. Add a normalization step that maps "tool" → "tool_name" in the response parser, or instruct-tune the system prompt to enforce the exact key format. The underlying capability is intact; the schema discipline requires enforcement.
Signal-Level Diagnostics (F16)
json_multistep
| Signal | Rate | Tier |
|---|---|---|
| schema_ok | 1.000 | Tier-1 (gating) |
| checks_consistent_ok | 0.800 | Tier-1 (gating) |
| stop_semantics_ok | 1.000 | Tier-1 (gating) |
| oracle_equiv_ok | 0.600 | Tier-1 (gating) |
| final_consistent_ok | 0.000 | Tier-2 (tracked, non-gating) |
| final_match_reported | 0.000 | Tier-2 (tracked, non-gating) |
Note on Tier-2:
final_consistent_okandfinal_match_reportedare not gating signals. Most deployed agentic systems compute and validate state externally. Tier-1 oracle equivalence (0.600) is the production-relevant signal.
stateful_followup
| Signal | Rate | Tier |
|---|---|---|
| turn1_parse_ok | 1.000 | Tier-1 |
| turn2_parse_ok | 1.000 | Tier-1 |
| turn1_exact_match | 1.000 | Tier-1 |
| turn2_exact_match | 1.000 | Tier-1 |
toolcall_only
| Signal | Rate | Tier |
|---|---|---|
| tool_name_ok | 1.000 | Tier-1 (gating) |
| args_ok | 1.000 | Tier-1 (gating) |
| schema_ok | 0.000 | Non-gating (tracked) |
mixed_brief_json
| Signal | Rate | Tier |
|---|---|---|
| answer_line_ok | 1.000 | Tier-1 |
| json_parse_ok | 1.000 | Tier-1 |
| schema_ok | 1.000 | Tier-1 |
Recommended Use Cases — F16
✅ Deploy with Confidence (F16)
- Stateful multi-turn agents — Perfect two-turn state retention (1.000). Both turns parse and match expected state exactly.
- Structured JSON outputs (single-step) — bucket_score avg of 10.000 on both
jsonandfuzz; consistently valid structured outputs. - Hybrid brief + JSON responses —
mixed_brief_jsonpasses at 1.000. - Tool-calling with response scaffolding —
toolcallpasses at 1.000. Tool call embedded in a broader response is fully reliable. - Tool-only dispatch with schema normalization —
toolcall_onlypasses gating at 1.000. Add a wrapper key normalization step for strict schema compliance. - JSON multi-step with external validation loop — 0.600 pass rate; workable with an external planner or repair loop.
⚠️ Use with Guardrails (F16)
- Strict bare tool-call dispatch — Schema wrapper non-compliance (
schema_ok=0) requires a normalization layer for systems enforcing exact JSON key format. - Hard multi-step planning without validation —
ms_hard_01fails at both check consistency and oracle equivalence.
❌ Not Recommended (F16)
- Unassisted multi-step planning — Where planning correctness must hold without external verification or oracle validation, particularly at medium-to-hard difficulty.
Hardware Requirements
| Configuration | VRAM Required | Recommended GPU |
|---|---|---|
| F16 (this repo) · full GPU offload | ~30 GB | A100 40 GB · 2× A10G · RTX 4090 (partial offload) |
| F16 · mixed CPU/GPU offload | 16–24 GB VRAM + 16 GB RAM | RTX 3090/4090 with n_gpu_layers tuning |
| Q4_K_M (companion repo) | ~10–12 GB | T4 16 GB · RTX 3080/4080 · A10G |
For most production use cases, the Q4_K_M variant at ~10–12 GB VRAM and 3.77 sec/case is the appropriate deployment target. The F16 is recommended when maximum output fidelity is required and hardware constraints allow, or when using the model as the reference baseline in a quantization evaluation pipeline.
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python — llama-cpp-python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
# Download F16 GGUF directly from HuggingFace Hub
# Note: 27 GB download — ensure sufficient disk space and ~30 GB VRAM
model_path = hf_hub_download(
repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
filename="ministral-3-14b-instruct-2512-gguf-F16.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=8192, # context window; increase up to 32768 per model spec
n_gpu_layers=-1, # -1 offloads all layers to GPU; reduce if VRAM < 30 GB
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful, concise assistant. Respond in structured JSON when asked."
},
{
"role": "user",
"content": "Summarize the following contract clause and flag any obligations: ..."
}
],
temperature=0.15,
max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])
For partial GPU offload when VRAM is between 16–24 GB:
llm = Llama(
model_path=model_path,
n_ctx=4096,
n_gpu_layers=20, # Tune based on available VRAM; remainder runs on CPU
verbose=True, # Enable to monitor layer offload and memory usage
)
For tool-calling with schema normalization (addressing the wrapper non-compliance noted above):
import json, re
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
model_path = hf_hub_download(
repo_id="pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16",
filename="ministral-3-14b-instruct-2512-gguf-F16.gguf"
)
llm = Llama(model_path=model_path, n_ctx=4096, n_gpu_layers=-1, verbose=False)
def normalize_tool_wrapper(raw: str) -> dict:
"""
Normalize F16 schema wrapper non-compliance.
Maps non-standard 'tool' key -> 'tool_name' before validation.
See quant_eval v7.21 toolcall_only finding: schema_ok=0 on both F16 cases.
"""
# Extract JSON block from markdown fences if present
match = re.search(r'```(?:json)?\s*([\s\S]*?)```', raw)
payload = match.group(1).strip() if match else raw.strip()
parsed = json.loads(payload)
# Normalize wrapper key
if "tool" in parsed and "tool_name" not in parsed:
parsed["tool_name"] = parsed.pop("tool")
assert "tool_name" in parsed and "args" in parsed
return parsed
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "Respond only with a valid JSON tool call."},
{"role": "user", "content": "Add 5 and 10."}
],
temperature=0.0,
max_tokens=256,
)
raw = response["choices"][0]["message"]["content"]
result = normalize_tool_wrapper(raw)
print(result)
CLI — llama-cli
# One-shot prompt (ensure sufficient VRAM before running)
llama-cli \
--model ministral-3-14b-instruct-2512-gguf-F16.gguf \
--chat-template mistral \
--system-prompt "You are a helpful assistant." \
--prompt "Summarize the following and return a JSON object with keys: summary, risk_level, action_items." \
--n-predict 512 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--temp 0.15
For server deployment (OpenAI-compatible endpoint):
llama-server \
--model ministral-3-14b-instruct-2512-gguf-F16.gguf \
--chat-template mistral \
--ctx-size 8192 \
--n-gpu-layers -1 \
--port 8080 \
--host 0.0.0.0
Query via the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")
response = client.chat.completions.create(
model="ministral-3-14b-instruct-2512-gguf-F16",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0.15,
)
print(response.choices[0].message.content)
Artifact Provenance
| Artifact | Format | Size | SHA256 |
|---|---|---|---|
ministral-3-14b-instruct-2512-gguf-F16.gguf |
GGUF F16 | 27.0 GB | 74ea113134173d29f8daba097457500e831eace3741de002846b3ab89781fd52 |
| Q4_K_M (companion repo) | GGUF Q4_K_M | 8.24 GB | a23910514ee512aa28db8dddd390c26a73b9c318dcdec374ae02d722d9658749 |
Both artifacts were produced from mistralai/Ministral-3-14B-Instruct-2512 using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems. Conversion was performed on the full HuggingFace snapshot without modification to model weights prior to conversion.
Evaluation Methodology
quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems. The F16 evaluation run (20260206_213615) produces the full_weight_cache.json used as the reference baseline in the subsequent Q4_K_M comparison run (20260209_170235). This two-run architecture — F16 first, Q4_K_M second against the cached F16 results — enables exact apples-to-apples comparison of capability retention across quantization levels on an identical fixture set.
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
Task families evaluated:
| Family | Description | Pass Signals |
|---|---|---|
fuzz |
Property-based regression; structured placement correctness | schema_ok, constraints_ok |
json |
Single-step structured JSON with constraint rules | schema_ok, constraints_ok |
json_multistep |
Multi-step planning with self-check and oracle verification | schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok |
mcq |
Multiple-choice extraction | choice_ok |
stateful_followup |
Two-turn state tracking; turn-2 correct given turn-1 | turn1/2_parse_ok, turn1/2_exact_match |
mixed_brief_json |
Hybrid: natural language answer + valid JSON block | answer_line_ok, json_parse_ok, schema_ok |
toolcall |
Tool call embedded in response; parse + schema validation | stage1_tool_parse_ok, stage1_tool_schema_ok |
toolcall_only |
Bare schema-only tool call; strict tool name + args check | tool_name_ok, args_ok |
Scores are conservative conjunctions — a case passes only when all gating signals succeed.
Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) F16 evaluation date: February 6, 2026 quant_eval seed: 42
🔬 About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints — particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.
Founder — Patrick Hill, M.S.
PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.
Technical expertise spans:
- Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
- ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
- AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
- Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
- Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
- Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture
Published Author
Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies — a 1,200+ page practitioner-oriented textbook covering statistical modeling, supervised and unsupervised learning, neural networks, NLP, and real-world decision-support case studies. The text has been adopted as required reading for CSC 373 – Machine Learning at the University of Advancing Technology, and reflects the same philosophy applied across all PBH systems: prioritize practical correctness over theoretical novelty, favor interpretable and reliable solutions, and introduce complexity only when justified by data and deployment constraints.
Core Service Areas
1. LLM Optimization & Deployment End-to-end conversion of full-weight HuggingFace models to production-ready GGUF format, with quantization strategies matched to target hardware and latency requirements. Custom-built llama.cpp pipelines with adapter-per-model architecture ensuring strict separation of concerns and universal cross-model compatibility.
2. AI Evaluation Frameworks Proprietary behavioral evaluation via quant_eval — multi-run, timestamped pipelines producing structured artifacts, SHA256-verified manifests, per-family pass rates, F16 vs. quantized delta analysis, and deployment-ready recommendations. Evaluation batteries cover structured JSON output, multi-step reasoning, tool-calling fidelity, MCQ benchmarking, and fuzz/regression testing.
3. Agentic AI Infrastructure Design and deployment of agent-oriented architectures using LlamaIndex ReAct agents, Flask orchestration layers, and serverless GPU inference. Full pipeline from model selection through quantization, evaluation, and production serving — including lead capture flows, budget controls, and API gateway integration.
4. Scalable AI Application Development Production-grade multimodal AI applications integrating quantized LLMs, Whisper (speech-to-text), and BLIP (vision) via modular Flask APIs with Dockerized deployment and streaming-style responses. Advanced time-series forecasting systems featuring custom lightweight attention mechanisms, ensemble meta-learning, Bayesian hyperparameter optimization with resource-aware OOM backoff, and FinBERT sentiment fusion for hybrid structured/unstructured data pipelines.
5. ML Pipeline Design & Analytics End-to-end data and model pipelines engineered for decision-support and operational forecasting. Encompasses feature engineering, leak-free forward-chaining cross-validation, KPI dashboard development, and analytical governance procedures designed for reproducibility at scale. Proven track record of translating complex model outputs into actionable insights for senior stakeholders across large-scale operational datasets.
6. Model & Agent Cataloging Structured model catalog publishing with reproducible artifacts, standardized reporting, and clear performance tradeoff documentation — enabling engineering teams to make informed deployment decisions without re-running evaluations from scratch.
Engineering Principles
- Reproducibility first — Every run produces structured artifacts, versioned manifests, and comparable outputs
- Universality as a requirement — Systems work across models without custom rewrites per deployment
- No silent behavior changes — Evaluation logic, prompts, and workflows are locked and versioned
- GPU utilization is non-negotiable — All pipelines are designed to fully leverage available hardware
- Separation of IP and operations — Core intellectual property is maintained independently of client deliverables
📞 Work With PBH Applied Systems
This F16 card documents what the model can do at full precision. The Q4_K_M companion card documents what degrades when you quantize — including a complete toolcall_only failure (1.000 → 0.000) that is invisible without running both evaluations. Without evaluating both formats against the same fixture set before deployment, you are making a deployment decision without the data to support it.
👉 Book a Scoping Call — Discuss your model selection, quantization strategy, or deployment architecture directly with Patrick.
👉 Request an Evaluation Report — A full quant_eval behavioral audit for your target model(s): per-family pass rates, F16 vs. quantized delta analysis, failure cluster diagnostics, and a deployment recommendation. Engagements from $2,500.
Connect
| 🌐 Website | pbhappliedsystems.com |
| patrick@pbhappliedsystems.com | |
| PBH Applied Systems, LLC | |
| ▶️ YouTube | @pbhappliedsystems |
| @pbhappliedsystems | |
| pbhappliedsystems |
License
This GGUF repository inherits the license of the base model:
Apache 2.0 — mistralai/Ministral-3-14B-Instruct-2512
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · F16 Run ID: 20260206_213615
- Downloads last month
- 15
16-bit
Model tree for pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-F16
Base model
mistralai/Ministral-3-14B-Base-2512