PosFormer Handwritten Math OCR — GGUF

PosFormer (Position-aware Transformer) for handwritten mathematical expression recognition, converted to GGUF format for use with CrispEmbed.

License — IMPORTANT

These weights are for academic/research use only.

Two license restrictions apply:

Model code: The original SJTU-DeepVisionLab/PosFormer states: "This code is only free for academic research purposes and licensed under the 2-clause BSD License." These GGUF weights are derived from their published checkpoint.
Training data: The model was trained on CROHME 2014, which is licensed CC BY-NC-SA 3.0 (non-commercial).

If you need weights for commercial use, you must retrain the PosFormer architecture on a permissibly-licensed dataset. The C++ inference engine (CrispEmbed) and GGUF converter are original implementations and carry no such restriction.

Model details

Property	Value
Architecture	DenseNet encoder + 3-layer Transformer decoder + ARM
Parameters	6.5M
Training data	CROHME 2014 (CC BY-NC-SA 3.0)
Vocabulary	113 LaTeX tokens
Input	Grayscale handwritten math image
Output	LaTeX token sequence

Attention Refinement Module (ARM)

PosFormer extends BTTR with an Attention Refinement Module that provides coverage-aware decoding. ARM uses accumulated cross-attention weights from previous decoder layers to prevent the model from repeatedly attending to the same spatial positions, reducing repetition errors in long expressions.

Files

File	Quant	Size	Notes
`posformer-hw-f32.gguf`	F32	24.9 MB	Full precision, reference quality
`posformer-hw-q8_0.gguf`	Q8_0	12 MB	Lossless on test images
`posformer-hw-q4_k.gguf`	Q4_K	10 MB	Lossless on test images

Accuracy (CROHME 2014 test set, 986 images)

Greedy left-to-right decoding (no beam search):

Model	Raw match	Parsed match
PosFormer F32	56.0%	61.4%
PosFormer Q8_0	~56%	~61%
PosFormer Q4_K	~56%	~61%
BTTR (baseline)	49.2%	49.8%
HMER	36.1%	36.3%

Note: the published PosFormer ExpRate of 62.7% uses bi-directional beam search (L2R + R2L, cross-scored). Our C++ port uses greedy L2R decoding only. The ~6pp gap vs published is expected from the lack of bi-directional scoring.

Usage with CrispEmbed

# Build
cd CrispEmbed-build
cmake /path/to/CrispEmbed
make -j$(nproc) test-posformer

# Run
export LD_LIBRARY_PATH=$PWD/ggml/src
./test-posformer posformer-hw-q8_0.gguf image.bmp

Parity verification

The C++ inference matches PyTorch reference to >99.999% (cosine similarity = 1.000000 at every decoder step, max absolute difference < 0.00001). Verified using per-layer intermediate dumps — see tests/parity/posformer_*.py in the CrispEmbed repo.

Citation

@inproceedings{chen2024posformer,
  title={PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer},
  author={Chen, Tongkun and others},
  booktitle={AAAI},
  year={2024}
}

References

Downloads last month: 95

GGUF

Model size

6.51M params

Architecture

posformer

Hardware compatibility

8-bit

32-bit