# MathTok **A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling** --- ## Overview MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is *structure-aware*: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure. ``` Raw Mathematical Expression ↓ Canonicalization Layer (sympy: simplify, expand, normalize) ↓ Hybrid Mathematical Lexer (split TEXT / MATH spans) ↓ AST Generator (SymPy tree → typed ASTNode tree) ↓ Operator-Aware Semantic Encoder (rich metadata per operator) ↓ Structural Serialization (DFS preorder → flat token stream) ↓ Structural Attention Metadata (per-token tree context) ↓ Vocabulary Mapping + BPE (fixed math vocab + HF BPE for text) ↓ Compressed Token Stream ``` --- ## Quick Start ```bash # Install dependencies and package in editable mode pip install -e ".[eval,dev]" # Tokenize an expression using the CLI pipeline python -m mathtok.pipeline "The derivative of sin(x^2) + 3x" # Run the comprehensive 110+ test suite pytest tests/ -v # Run the 4-way comparative tokenizer evaluation benchmark # (MathTok vs GPT-2 BPE vs SentencePiece Unigram vs Char-level) python -m evaluation.comparison # Generate visual plots and the unified metrics dashboard python -m evaluation.visualize ``` --- ## Python API ```python from mathtok import MathTokPipeline pipeline = MathTokPipeline() # Encode mixed text + math (supporting LaTeX or ASCII syntax) out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.") print(out.tokens) # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...] print(out.sexp) # (FUNC_SIN (OP_POW VAR_X CONST_2)) print(out.input_ids) # [4, 27, 10, 45, 12, 5, ...] # Access structural metadata (for tree-aware attention masking) for meta in out.metadata: print(meta.token, meta.depth, meta.tree_position_key) # Pure math expression serialization out = pipeline.encode_math_only("(x+1)^2") print(out.sexp) # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2) # HuggingFace-compatible tokenizer export hf_tok = pipeline.get_hf_tokenizer() hf_tok.save_pretrained("./mathtok-tokenizer") result = hf_tok("x^2 + 2*x + 1", return_tensors="pt") ``` --- ## Research Contributions ### 1. Hybrid Lexer Separates natural language from mathematical content using LaTeX delimiter detection (`$...$`, `\(...\)`, `\[...\]`) and ASCII math heuristics. ### 2. Canonicalization Engine Normalizes mathematically equivalent expressions via SymPy's `simplify()`, `expand()`, and internal representation (subtraction → addition + negation, division → multiplication + reciprocal). ### 3. AST-Based Structural Serialization Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal. ### 4. Operator Semantic Registry Every operator and function carries an explicit metadata record: `arity`, `precedence`, `associativity`, `semantic_role`. This is the primary novelty over standard tokenization. ### 5. Structural Attention Metadata Per-token records encoding `depth`, `parent_id`, `children_ids`, `tree_position_key`, and `sibling_count` — enabling future structure-aware attention. ### 6. Two-Tier Vocabulary - **Fixed math vocabulary**: deterministic IDs for all operators, functions, variables, constants. - **BPE text vocabulary**: HuggingFace `tokenizers` BPE for natural language spans. --- ## Evaluation Metrics & Benchmarks ### Core Metrics | Metric | Symbol | Meaning | |--------|--------|---------| | **Semantic Compression Ratio** | SCR | `structural_score / token_count` (Higher is better — measures parsed semantic content density) | | **Semantic Density** | SD | `math_tokens / total_tokens` (Ratio of high-value math tokens, measures information density) | | **Structural Efficiency** | SE | `parent_child_relations / token_count` (Ratio of hierarchy relationships encoded per token) | | **Token Stability** | TS | `1 - CoV(token count across rewritings)` (Fidelity and stability across representations) | ### Empirical Benchmarks (4-Way Comparison) Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions: | Tokenizer | Mean SCR (↑ Better) | Semantic Density (↑ Better) | Structural Efficiency (↑ Better) | |:---|:---:|:---:|:---:| | **MathTok (Ours)** | **0.8501** | **0.5285** | **0.2339** | | **GPT-2 BPE** | 0.4251 | 0.1838 | 0.1491 | | **SentencePiece Unigram** | 0.3696 | 0.1499 | 0.1403 | | **Character-Level** | 0.3708 | 0.1518 | 0.1518 | > [!NOTE] > * MathTok achieves a **2.30x structural compression improvement** over SentencePiece. > * MathTok packs **3.52x more math-centric information** per token stream compared to SentencePiece unigrams (**0.5285** vs **0.1499**), showing immense semantic density. > * MathTok is **1.67x more efficient** at encoding hierarchical ast relationships directly into token structures (**0.2339** vs **0.1403**). ### High-Impact Visualizations The visualization system runs via `python -m evaluation.visualize` and exports professional visual assets under [`evaluation/results/`](file:///c:/Users/surwe/Project/math_token/evaluation/results/): - **Unified Evaluation Dashboard** (`metrics_dashboard.png`): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency. - **Overall SCR Comparison** (`scr_comparison.png`): Comparative summary bar chart. - **Category-Level Breakdowns** (`scr_by_category.png`): SCR analyzed by nested/standard categories. - **Semantic Density Summary** (`semantic_density_comparison.png`): Ratio of math structure to total tokens. --- ## Project Structure ``` math_token/ ├── mathtok/ │ ├── canonicalizer.py # Layer 1: Canonicalization Engine │ ├── lexer.py # Layer 2: Hybrid Mathematical Lexer │ ├── ast_generator.py # Layer 3: AST Generator │ ├── operator_registry.py # Layer 4: Operator Semantic Registry │ ├── serializer.py # Layer 5: Structural Traversal & Serialization │ ├── metadata.py # Layer 6: Structural Attention Metadata │ ├── vocabulary.py # Layer 7: Two-Tier Vocabulary │ └── pipeline.py # Orchestrator Pipeline ├── evaluation/ │ ├── metrics.py # Definition of core evaluation metrics │ ├── benchmark.py # Quick benchmarking scripts │ ├── comparison.py # Full 4-way comparative framework (SentencePiece integrated) │ ├── visualize.py # Custom dashboard visualization engine │ └── results/ # JSON/JSONL reports & visual plots └── tests/ # 110+ passing unit tests ``` --- ## Citation ```bibtex @article{mathtok2024, title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling}, author = {Anonymous}, year = {2024}, note = {Under review} } ```