ERNIE-4.5-21B-A3B-Thinking-GGUF Adapter: External Logit Corrector

This repository contains a lightweight PyTorch adapter trained with the ggufForge toolkit. It is designed to improve the generation quality of the ultra-low-bit quantized GGUF model unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF without modifying the base model itself.

The adapter implements External Logit Correction – a small transformer that refines the logits produced by llama.cpp during inference. It is not a standalone GGUF file and cannot be used directly with pure llama.cpp. Instead, you need the ggufForge inference wrapper, which loads both the base GGUF model (via llama-cpp-python) and the adapter (in PyTorch), then combines them on the fly.


Model Details

  • Base Model: unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF
    Specific file used: ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf
  • Adapter Architecture: 256‑dim single‑block causal transformer with 8 attention heads (~1M parameters)
  • Training Dataset: prithivMLmods/Atlas-Think-Cot-12M
    (chain‑of‑thought reasoning examples)
  • Training Steps: 14,000
  • Validation Perplexity Improvement: 21.2% (base 4.39 → adapted 3.46)
  • License: Apache 2.0

Intended Use

This adapter is intended for research and experimentation with ultra-low‑bit quantized LLMs. It demonstrates that a tiny external network can recover much of the quality lost during extreme quantization, especially on MoE‑based models like ERNIE.

Do not expect this adapter to work with other base models (e.g., Llama, Mistral) – it was trained specifically for ERNIE‑4.5‑21B with the exact tokenizer and vocabulary of that GGUF file.


How to Use

You need the ggufForge Python package to run inference with this adapter. The library handles downloading the base GGUF model (if not already cached), loading the adapter weights, and performing the combined forward pass.

1. Install ggufForge

git clone https://github.com/ShotokanOSS/ggufForge.git
cd ggufForge
python -m venv .venv
source .venv/bin/activate   # or .venv\Scripts\activate on Windows

# Basic install
pip install -U pip
pip install .

# For GPU acceleration (recommended)
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install .

2. Run Inference with the Adapter

The adapter is automatically downloaded from this Hugging Face repo when you specify --adapter-repo ShotokanJ/ERNIE-4.5-21B-A3B-Thinking-GGUF-finetune-Atlas-Think-Cot.

Single question

run-inference \
  --mode single \
  --question "Explain the concept of quantum entanglement in simple terms." \
  --adapter-repo ShotokanJ/ERNIE-4.5-21B-A3B-Thinking-GGUF-finetune-Atlas-Think-Cot \
  --base-repo unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF \
  --gguf-filename ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf

Interactive chat

run-inference \
  --mode chat \
  --question "Explain the concept of quantum entanglement in simple terms." \
  --adapter-repo ShotokanJ/ERNIE-4.5-21B-A3B-Thinking-GGUF-finetune-Atlas-Think-Cot \
  --base-repo unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF \
  --gguf-filename ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf

Compare base vs. adapter performance

run-inference \
  --mode compare \
  --question "Explain the concept of quantum entanglement in simple terms." \
  --adapter-repo ShotokanJ/ERNIE-4.5-21B-A3B-Thinking-GGUF-finetune-Atlas-Think-Cot \
  --base-repo unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF \
  --gguf-filename ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf

All generation parameters (temperature, min‑p, repetition penalty, etc.) can be adjusted as described in the ggufForge documentation.

Training Details

The adapter was trained using the train-adapter command from ggufForge:

train-adapter \
  --model unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF \
  --filename ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf \
  --dataset prithivMLmods/Atlas-Think-Cot-12M \
  --adapter-dim 256 \
  --steps 14000 \
  --learning-rate 5e-5 \
  --batch-size 1 \
  --accumulation-steps 32 \
  --output-dir ernie-adapter
  • Hardware: Single NVIDIA A100 (40GB)
  • Training time: ~6 hours
  • Loss function: Cross‑entropy on corrected logits
  • Optimizer: AdamW (weight decay 0.01)
  • Context size: 1024 tokens (streaming from dataset)

Performance

Metric Base Model + Adapter Improvement
Validation Perplexity 4.39 3.46 21.2%
Human‑rated coherence Noticeably better reasoning chains

Detailed evaluation logs can be found in the ggufForge study (to be added).


Limitations

  • Vocabulary lock: The adapter only works with the exact tokenizer of the ERNIE GGUF file. It cannot be transferred to other models, even within the same family, without retraining.
  • Inference speed: Because the adapter runs in PyTorch while the base model runs in llama.cpp, there is a small overhead (typically 10–20% slower than pure llama.cpp).
  • Context window: The adapter was trained with a context size of 1024; using larger contexts may degrade performance (though the adapter can be used with up to 8192 tokens with some loss).
  • Not a standalone solution: You must use the ggufForge wrapper; pure llama.cpp or other UIs will not recognize the adapter.

For questions or issues, please open a ticket in the ggufForge issue tracker.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShotokanJ/Qwen3-30B-A3B-Instruct-finetune-Atlas-Think-Cot-Test

Finetuned
(78)
this model

Dataset used to train ShotokanJ/Qwen3-30B-A3B-Instruct-finetune-Atlas-Think-Cot-Test