SurweeshSP commited on
Commit
79c94e9
·
verified ·
1 Parent(s): d56d262

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md CHANGED
@@ -68,6 +68,28 @@ thumbnail: assets/mathtok_architecture_improvements.svg
68
  # MathTok
69
 
70
  **A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## Overview
73
 
@@ -94,6 +116,13 @@ Compressed Token Stream
94
  ```
95
 
96
  ---
 
 
 
 
 
 
 
97
  ## Installation
98
 
99
  Clone the repository and install the package in editable mode:
@@ -329,6 +358,18 @@ math_token/
329
 
330
  ---
331
 
 
 
 
 
 
 
 
 
 
 
 
 
332
  ## Citation
333
 
334
  ```bibtex
@@ -342,3 +383,10 @@ math_token/
342
  url = {https://huggingface.co/Surweesh/MathTok}
343
  }
344
  ```
 
 
 
 
 
 
 
 
68
  # MathTok
69
 
70
  **A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling**
71
+ ![Python](https://img.shields.io/badge/Python-3.10-blue)
72
+ ![License](https://img.shields.io/badge/License-MIT-green)
73
+ ![HuggingFace](https://img.shields.io/badge/HuggingFace-Live-yellow)
74
+ ![Tests](https://img.shields.io/badge/Tests-110%2B-success)
75
+ ![Research](https://img.shields.io/badge/Focus-Symbolic%20AI-purple)
76
+
77
+ ---
78
+
79
+ ## Why MathTok?
80
+
81
+ Traditional tokenizers such as BPE and SentencePiece treat mathematical
82
+ expressions as plain text sequences, fragmenting semantic structure and
83
+ discarding operator hierarchy.
84
+
85
+ MathTok introduces a structure-aware tokenization pipeline that:
86
+ - canonicalizes equivalent mathematical expressions,
87
+ - preserves AST hierarchy,
88
+ - encodes operator semantics explicitly,
89
+ - improves symbolic compression efficiency,
90
+ - and enables future tree-aware transformer architectures.
91
+
92
+ ---
93
 
94
  ## Overview
95
 
 
116
  ```
117
 
118
  ---
119
+
120
+ ## Architecture
121
+
122
+ ![MathTok Architecture](assets/mathtok_architecture_improvements.svg)
123
+
124
+ ---
125
+
126
  ## Installation
127
 
128
  Clone the repository and install the package in editable mode:
 
358
 
359
  ---
360
 
361
+ ## Future Work
362
+
363
+ - Tree-aware transformer attention integration
364
+ - Native mathematical pretraining corpus
365
+ - Symbolic reasoning benchmarks
366
+ - Neural theorem proving interfaces
367
+ - Equation graph embeddings
368
+ - Mathematical multimodal tokenization
369
+ - Integration with Lean/Coq theorem systems
370
+
371
+ ---
372
+
373
  ## Citation
374
 
375
  ```bibtex
 
383
  url = {https://huggingface.co/Surweesh/MathTok}
384
  }
385
  ```
386
+
387
+ ---
388
+
389
+ ## Links
390
+
391
+ - GitHub: https://github.com/SurweeshSP/mathtok
392
+ - Hugging Face: https://huggingface.co/Surweesh/MathTok