--- license: apache-2.0 library_name: gguf pipeline_tag: image-to-text tags: - latex-ocr - math - formula-recognition - gguf - crispembed - ppformulanet - sam-vit - mbart base_model: PaddlePaddle/PP-FormulaNet-L_safetensors --- # PP-FormulaNet-L GGUF — Printed Math OCR GGUF-quantized versions of [PaddlePaddle's PP-FormulaNet-L](https://huggingface.co/PaddlePaddle/PP-FormulaNet-L_safetensors) for on-device printed formula recognition (image to LaTeX). ## Model Description | Property | Value | |---|---| | Architecture | SAM-ViT encoder + MBart Transformer decoder | | Parameters | 181M | | Input | 768x768 RGB image | | Output | LaTeX token sequence | | Vocab | 50,000 tokens (NougatTokenizer / BPE) | | License | **Apache-2.0** | ### Encoder: SAM-style Vision Transformer | Property | Value | |---|---| | Type | SAM ViT-B (from PaddleOCR Vary_VIT_B_Formula) | | Layers | 12 | | Hidden dim | 768 | | Heads | 12 | | MLP dim | 3072 | | Patch size | 16x16 (48x48 patches) | | Attention | Windowed (ws=14) on layers 0,1,3,4,6,7,9,10; Global on layers 2,5,8,11 | | Position bias | Decomposed relative position (per-axis, interpolated) | | Neck | Conv1x1 + LayerNorm2d + Conv3x3 + LayerNorm2d (768 -> 256) | | Projector | 2x Conv3x3(stride=2) + 2x Linear (256 -> 512). Output: 144 tokens x 512d | ### Decoder: MBart (Pre-LayerNorm) | Property | Value | |---|---| | Layers | 8 | | Heads | 16 | | d_model | 512 | | FFN dim | 2048 | | Activation | GELU | | Embedding | scale_embedding = sqrt(512) | | Max length | 1024 tokens | ## Available Variants | File | Quant | Size | Encoder cos vs F32 | Notes | |---|---|---|---|---| | `ppformulanet-l-f16.gguf` | FP16 | 347 MB | baseline | Full precision | | `ppformulanet-l-q8_0.gguf` | Q8_0 | 241 MB | 0.999940 | Critical tensors in F16 | | `ppformulanet-l-q4_k.gguf` | Q4_K | 122 MB | 0.997595 | Desktop/mobile target | All three produce **identical decoded LaTeX** on test formulas. **Recommended: `ppformulanet-l-q8_0.gguf`** (241 MB) — near-lossless quality at 1.4x compression vs F16. **For mobile/desktop: `ppformulanet-l-q4_k.gguf`** (122 MB) — good quality at 2.8x compression. ### Quantization Strategy **Q8_0 mode** keeps critical tensors in FP16 for quality: - Embeddings (token, position, patch) - LayerNorm weights/biases - Relative position bias tables (tiny, critical for attention geometry) - LM head (determines output tokens) - Neck and projector weights (encoder-decoder bottleneck) Large attention/MLP weight matrices are quantized to Q8_0. **Q4_K mode** uses CrispEmbed's `crispembed-quantize` tool (K-quant with importance-weighted groups). LayerNorm and biases stay in F16; large matrices go to Q4_K. ## Usage with CrispEmbed ```bash # CLI — auto-detected from GGUF metadata crispembed -m ppformulanet-l-q8_0.gguf --ocr formula.png # Output: LaTeX string # \zeta_{0}(\nu) = -\frac{\nu\varrho^{-2\nu}}{\pi} ... ``` ### C API ```c #include "crispembed.h" void *ctx = crispembed_math_ocr_init("ppformulanet-l-q8_0.gguf", 4); int len; const char *latex = crispembed_math_ocr_recognize(ctx, pixels, w, h, channels, &len); printf("%s\n", latex); crispembed_math_ocr_free(ctx); ``` Architecture auto-detection reads `general.architecture = "ppformulanet_l"` from GGUF metadata. ## Image Preprocessing The model expects UniMERNet-style preprocessing: 1. Convert to grayscale, replicate to 3 channels 2. Resize maintaining aspect ratio to fit 768x768 3. Center-pad with black (0) to fill 768x768 4. Normalize: mean=0.7931, std=0.1738 CrispEmbed handles this automatically when you pass raw image bytes. ## Parity Verification Tested against HuggingFace `PPFormulaNetForConditionalGeneration` reference: | Metric | F32 | Q8_0 | Q4_K | |---|---|---|---| | Encoder cosine similarity | 0.999962 | 0.999940 | 0.997595 | | Top-1 token match | Yes | Yes | Yes | | Full decode match | Yes | Yes | Yes | ## Credits - [PaddlePaddle/PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) — PP-FormulaNet-L architecture and weights (Apache-2.0) - [HuggingFace Transformers](https://huggingface.co/PaddlePaddle/PP-FormulaNet-L_safetensors) — safetensors conversion and reference implementation - [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed) — GGUF conversion, C++ inference engine, quantization ## Conversion ```bash # F16 python models/convert-ppformulanet-l-to-gguf.py \ --model-dir PP-FormulaNet-L --output ppformulanet-l-f16.gguf --fp16 # Q8_0 (critical tensors in F16) python models/convert-ppformulanet-l-to-gguf.py \ --model-dir PP-FormulaNet-L --output ppformulanet-l-q8_0.gguf --q8_0 # Q4_K (from F16 via C quantizer) crispembed-quantize ppformulanet-l-f16.gguf ppformulanet-l-q4_k.gguf q4_k ```