PosFormer Handwritten Math OCR β€” GGUF

PosFormer (Position-aware Transformer) for handwritten mathematical expression recognition, converted to GGUF format for use with CrispEmbed.

License β€” IMPORTANT

These weights are for academic/research use only.

Two license restrictions apply:

  1. Model code: The original SJTU-DeepVisionLab/PosFormer states: "This code is only free for academic research purposes and licensed under the 2-clause BSD License." These GGUF weights are derived from their published checkpoint.

  2. Training data: The model was trained on CROHME 2014, which is licensed CC BY-NC-SA 3.0 (non-commercial).

If you need weights for commercial use, you must retrain the PosFormer architecture on a permissibly-licensed dataset. The C++ inference engine (CrispEmbed) and GGUF converter are original implementations and carry no such restriction.

Model details

Property Value
Architecture DenseNet encoder + 3-layer Transformer decoder + ARM
Parameters 6.5M
Training data CROHME 2014 (CC BY-NC-SA 3.0)
Vocabulary 113 LaTeX tokens
Input Grayscale handwritten math image
Output LaTeX token sequence

Attention Refinement Module (ARM)

PosFormer extends BTTR with an Attention Refinement Module that provides coverage-aware decoding. ARM uses accumulated cross-attention weights from previous decoder layers to prevent the model from repeatedly attending to the same spatial positions, reducing repetition errors in long expressions.

Files

File Quant Size Notes
posformer-hw-f32.gguf F32 24.9 MB Full precision, reference quality
posformer-hw-q8_0.gguf Q8_0 12 MB Lossless on test images
posformer-hw-q4_k.gguf Q4_K 10 MB Lossless on test images

Accuracy (CROHME 2014 test set, 986 images)

Greedy left-to-right decoding (no beam search):

Model Raw match Parsed match
PosFormer F32 56.0% 61.4%
PosFormer Q8_0 ~56% ~61%
PosFormer Q4_K ~56% ~61%
BTTR (baseline) 49.2% 49.8%
HMER 36.1% 36.3%

Note: the published PosFormer ExpRate of 62.7% uses bi-directional beam search (L2R + R2L, cross-scored). Our C++ port uses greedy L2R decoding only. The ~6pp gap vs published is expected from the lack of bi-directional scoring.

Usage with CrispEmbed

# Build
cd CrispEmbed-build
cmake /path/to/CrispEmbed
make -j$(nproc) test-posformer

# Run
export LD_LIBRARY_PATH=$PWD/ggml/src
./test-posformer posformer-hw-q8_0.gguf image.bmp

Parity verification

The C++ inference matches PyTorch reference to >99.999% (cosine similarity = 1.000000 at every decoder step, max absolute difference < 0.00001). Verified using per-layer intermediate dumps β€” see tests/parity/posformer_*.py in the CrispEmbed repo.

Citation

@inproceedings{chen2024posformer,
  title={PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer},
  author={Chen, Tongkun and others},
  booktitle={AAAI},
  year={2024}
}

References

Downloads last month
95
GGUF
Model size
6.51M params
Architecture
posformer
Hardware compatibility
Log In to add your hardware

8-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support