PosFormer Handwritten Math OCR β GGUF
PosFormer (Position-aware Transformer) for handwritten mathematical expression recognition, converted to GGUF format for use with CrispEmbed.
License β IMPORTANT
These weights are for academic/research use only.
Two license restrictions apply:
Model code: The original SJTU-DeepVisionLab/PosFormer states: "This code is only free for academic research purposes and licensed under the 2-clause BSD License." These GGUF weights are derived from their published checkpoint.
Training data: The model was trained on CROHME 2014, which is licensed CC BY-NC-SA 3.0 (non-commercial).
If you need weights for commercial use, you must retrain the PosFormer architecture on a permissibly-licensed dataset. The C++ inference engine (CrispEmbed) and GGUF converter are original implementations and carry no such restriction.
Model details
| Property | Value |
|---|---|
| Architecture | DenseNet encoder + 3-layer Transformer decoder + ARM |
| Parameters | 6.5M |
| Training data | CROHME 2014 (CC BY-NC-SA 3.0) |
| Vocabulary | 113 LaTeX tokens |
| Input | Grayscale handwritten math image |
| Output | LaTeX token sequence |
Attention Refinement Module (ARM)
PosFormer extends BTTR with an Attention Refinement Module that provides coverage-aware decoding. ARM uses accumulated cross-attention weights from previous decoder layers to prevent the model from repeatedly attending to the same spatial positions, reducing repetition errors in long expressions.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
posformer-hw-f32.gguf |
F32 | 24.9 MB | Full precision, reference quality |
posformer-hw-q8_0.gguf |
Q8_0 | 12 MB | Lossless on test images |
posformer-hw-q4_k.gguf |
Q4_K | 10 MB | Lossless on test images |
Accuracy (CROHME 2014 test set, 986 images)
Greedy left-to-right decoding (no beam search):
| Model | Raw match | Parsed match |
|---|---|---|
| PosFormer F32 | 56.0% | 61.4% |
| PosFormer Q8_0 | ~56% | ~61% |
| PosFormer Q4_K | ~56% | ~61% |
| BTTR (baseline) | 49.2% | 49.8% |
| HMER | 36.1% | 36.3% |
Note: the published PosFormer ExpRate of 62.7% uses bi-directional beam search (L2R + R2L, cross-scored). Our C++ port uses greedy L2R decoding only. The ~6pp gap vs published is expected from the lack of bi-directional scoring.
Usage with CrispEmbed
# Build
cd CrispEmbed-build
cmake /path/to/CrispEmbed
make -j$(nproc) test-posformer
# Run
export LD_LIBRARY_PATH=$PWD/ggml/src
./test-posformer posformer-hw-q8_0.gguf image.bmp
Parity verification
The C++ inference matches PyTorch reference to >99.999% (cosine
similarity = 1.000000 at every decoder step, max absolute difference
< 0.00001). Verified using per-layer intermediate dumps β see
tests/parity/posformer_*.py in the CrispEmbed repo.
Citation
@inproceedings{chen2024posformer,
title={PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer},
author={Chen, Tongkun and others},
booktitle={AAAI},
year={2024}
}
References
- Downloads last month
- 95
8-bit
32-bit