--- language: - en license: mit library_name: custom tags: - tokenizer - symbolic-ai - mathematics - llm - reasoning - ast - compiler - nlp - deep-learning - machine-learning - mathematical-reasoning - symbolic-reasoning - tokenization - parser - transformers - artificial-intelligence pipeline_tag: text-generation datasets: - custom-mathematical-dataset metrics: - semantic-density - structural-efficiency - symbolic-compression-ratio model-index: - name: MathTok results: - task: type: tokenization name: Mathematical Tokenization dataset: name: Custom Mathematical Benchmark type: symbolic-math metrics: - type: semantic-density value: Improved name: Semantic Density - type: structural-efficiency value: Optimized name: Structural Efficiency - type: symbolic-compression-ratio value: Enhanced name: SCR co2_eq_emissions: emissions: 0 license_name: mit pretty_name: MathTok thumbnail: assets/mathtok_architecture_improvements.svg --- # MathTok **A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling** ![Python](https://img.shields.io/badge/Python-3.10-blue) ![License](https://img.shields.io/badge/License-MIT-green) ![HuggingFace](https://img.shields.io/badge/HuggingFace-Live-yellow) ![Tests](https://img.shields.io/badge/Tests-110%2B-success) ![Research](https://img.shields.io/badge/Focus-Symbolic%20AI-purple) --- ## Why MathTok? Traditional tokenizers such as BPE and SentencePiece treat mathematical expressions as plain text sequences, fragmenting semantic structure and discarding operator hierarchy. MathTok introduces a structure-aware tokenization pipeline that: - canonicalizes equivalent mathematical expressions, - preserves AST hierarchy, - encodes operator semantics explicitly, - improves symbolic compression efficiency, - and enables future tree-aware transformer architectures. --- ## Overview MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is *structure-aware*: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure. ``` Raw Mathematical Expression ↓ Canonicalization Layer (sympy: simplify, expand, normalize) ↓ Hybrid Mathematical Lexer (split TEXT / MATH spans) ↓ AST Generator (SymPy tree → typed ASTNode tree) ↓ Operator-Aware Semantic Encoder (rich metadata per operator) ↓ Structural Serialization (DFS preorder → flat token stream) ↓ Structural Attention Metadata (per-token tree context) ↓ Vocabulary Mapping + BPE (fixed math vocab + HF BPE for text) ↓ Compressed Token Stream ``` --- ## Architecture ![MathTok Architecture](assets/mathtok_architecture_improvements.svg) --- ## Installation Clone the repository and install the package in editable mode: ```bash git clone https://github.com/SurweeshSP/mathtok.git cd mathtok pip install -e ".[eval,dev]" ``` --- ## Quick Start ### Tokenize a Mathematical Expression Run the tokenizer pipeline directly from the command line: ```bash python -m mathtok.pipeline "The derivative of sin(x^2) + 3x" ``` Example output: ```text [ FUNCTION_SIN, VARIABLE_x, POWER, NUMBER_2, OP_ADD, NUMBER_3, VARIABLE_x ] ``` --- ## Running the Test Suite Execute the comprehensive unit and integration test suite: ```bash pytest tests/ -v ``` Current coverage includes: - AST generation - Canonicalization - Lexer validation - Pipeline integration - Serialization consistency - Structural comparison metrics --- ## Comparative Tokenizer Evaluation Run the full benchmark evaluation pipeline: ```bash python -m evaluation.comparison ``` This benchmark compares: - MathTok (Hybrid AST Tokenizer) - GPT-2 BPE - SentencePiece Unigram - Character-Level Tokenization Evaluation metrics include: - Symbolic Compression Ratio (SCR) - Semantic Density - Structural Efficiency - Token Fragmentation - Sequence Compactness --- ## Visualization Dashboard Generate benchmark plots and the unified evaluation dashboard: ```bash python -m evaluation.visualize ``` Generated outputs include: - Semantic Density Comparison - SCR Comparison - Structural Efficiency Comparison - Token Count Analysis - Unified Metrics Dashboard All generated figures are stored in: ```text evaluation/results/ ``` --- ## Repository Structure ```text mathtok/ ├── mathtok/ # Core tokenizer framework ├── evaluation/ # Benchmarking and evaluation ├── tests/ # Comprehensive test suite ├── assets/ # Architecture diagrams ├── README.md ├── setup.py └── pyproject.toml ``` --- ## Python API ```python from mathtok import MathTokPipeline pipeline = MathTokPipeline() # Encode mixed text + math (supporting LaTeX or ASCII syntax) out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.") print(out.tokens) # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...] print(out.sexp) # (FUNC_SIN (OP_POW VAR_X CONST_2)) print(out.input_ids) # [4, 27, 10, 45, 12, 5, ...] # Access structural metadata (for tree-aware attention masking) for meta in out.metadata: print(meta.token, meta.depth, meta.tree_position_key) # Pure math expression serialization out = pipeline.encode_math_only("(x+1)^2") print(out.sexp) # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2) # HuggingFace-compatible tokenizer export hf_tok = pipeline.get_hf_tokenizer() hf_tok.save_pretrained("./mathtok-tokenizer") result = hf_tok("x^2 + 2*x + 1", return_tensors="pt") ``` --- ## Research Contributions ### 1. Hybrid Lexer Separates natural language from mathematical content using LaTeX delimiter detection (`$...$`, `\(...\)`, `\[...\]`) and ASCII math heuristics. ### 2. Canonicalization Engine Normalizes mathematically equivalent expressions via SymPy's `simplify()`, `expand()`, and internal representation (subtraction → addition + negation, division → multiplication + reciprocal). ### 3. AST-Based Structural Serialization Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal. ### 4. Operator Semantic Registry Every operator and function carries an explicit metadata record: `arity`, `precedence`, `associativity`, `semantic_role`. This is the primary novelty over standard tokenization. ### 5. Structural Attention Metadata Per-token records encoding `depth`, `parent_id`, `children_ids`, `tree_position_key`, and `sibling_count` — enabling future structure-aware attention. ### 6. Two-Tier Vocabulary - **Fixed math vocabulary**: deterministic IDs for all operators, functions, variables, constants. - **BPE text vocabulary**: HuggingFace `tokenizers` BPE for natural language spans. --- ## Evaluation Metrics & Benchmarks ### Core Metrics | Metric | Symbol | Meaning | |--------|--------|---------| | **Semantic Compression Ratio** | SCR | `structural_score / token_count` (Higher is better — measures parsed semantic content density) | | **Semantic Density** | SD | `math_tokens / total_tokens` (Ratio of high-value math tokens, measures information density) | | **Structural Efficiency** | SE | `parent_child_relations / token_count` (Ratio of hierarchy relationships encoded per token) | | **Token Stability** | TS | `1 - CoV(token count across rewritings)` (Fidelity and stability across representations) | ### Empirical Benchmarks (4-Way Comparison) Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions: | Tokenizer | Mean SCR (↑ Better) | Semantic Density (↑ Better) | Structural Efficiency (↑ Better) | |:---|:---:|:---:|:---:| | **MathTok (Ours)** | **0.8501** | **0.5285** | **0.2339** | | **GPT-2 BPE** | 0.4251 | 0.1838 | 0.1491 | | **SentencePiece Unigram** | 0.3696 | 0.1499 | 0.1403 | | **Character-Level** | 0.3708 | 0.1518 | 0.1518 | > [!NOTE] > * MathTok achieves a **2.30x structural compression improvement** over SentencePiece. > * MathTok packs **3.52x more math-centric information** per token stream compared to SentencePiece unigrams (**0.5285** vs **0.1499**), showing immense semantic density. > * MathTok is **1.67x more efficient** at encoding hierarchical ast relationships directly into token structures (**0.2339** vs **0.1403**). ### High-Impact Visualizations The visualization system runs via `python -m evaluation.visualize` and exports professional visual assets under [`evaluation/results/`](file:///c:/Users/surwe/Project/math_token/evaluation/results/): - **Unified Evaluation Dashboard** (`metrics_dashboard.png`): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency. - **Overall SCR Comparison** (`scr_comparison.png`): Comparative summary bar chart. - **Category-Level Breakdowns** (`scr_by_category.png`): SCR analyzed by nested/standard categories. - **Semantic Density Summary** (`semantic_density_comparison.png`): Ratio of math structure to total tokens. --- ## Project Structure ``` math_token/ ├── mathtok/ │ ├── canonicalizer.py # Layer 1: Canonicalization Engine │ ├── lexer.py # Layer 2: Hybrid Mathematical Lexer │ ├── ast_generator.py # Layer 3: AST Generator │ ├── operator_registry.py # Layer 4: Operator Semantic Registry │ ├── serializer.py # Layer 5: Structural Traversal & Serialization │ ├── metadata.py # Layer 6: Structural Attention Metadata │ ├── vocabulary.py # Layer 7: Two-Tier Vocabulary │ └── pipeline.py # Orchestrator Pipeline ├── evaluation/ │ ├── metrics.py # Definition of core evaluation metrics │ ├── benchmark.py # Quick benchmarking scripts │ ├── comparison.py # Full 4-way comparative framework (SentencePiece integrated) │ ├── visualize.py # Custom dashboard visualization engine │ └── results/ # JSON/JSONL reports & visual plots └── tests/ # 110+ passing unit tests ``` --- ## Future Work - Tree-aware transformer attention integration - Native mathematical pretraining corpus - Symbolic reasoning benchmarks - Neural theorem proving interfaces - Equation graph embeddings - Mathematical multimodal tokenization - Integration with Lean/Coq theorem systems --- ## Citation ```bibtex @article{surweesh2026mathtok, title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling and Symbolic Reasoning}, author = {Surweesh SP}, year = {2026}, journal = {Preprint}, note = {Open-source research framework available on GitHub and Hugging Face}, keywords = {Mathematical Tokenization, Symbolic AI, Abstract Syntax Trees, LLMs, NLP, Mathematical Reasoning, Canonicalization}, url = {https://huggingface.co/Surweesh/MathTok} } ``` --- ## Links - GitHub: https://github.com/SurweeshSP/mathtok - Hugging Face: https://huggingface.co/Surweesh/MathTok