Text Classification
Transformers
ONNX
Safetensors
English
mimelens
file-type-detection
mime-classification
binary-content
binary-analysis
position-agnostic
libmagic
forensics
packet-inspection
bpe
byte-pair-encoding
custom_code
Eval Results (legacy)
Instructions to use mjbommar/mimelens-001-medium-bpe-16k-s1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mjbommar/mimelens-001-medium-bpe-16k-s1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True)# Load model directly from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: transformers | |
| language: | |
| - en | |
| tags: | |
| - file-type-detection | |
| - mime-classification | |
| - binary-content | |
| - binary-analysis | |
| - position-agnostic | |
| - libmagic | |
| - forensics | |
| - packet-inspection | |
| - bpe | |
| - byte-pair-encoding | |
| - mimelens | |
| base_model: mjbommar/binary-tokenizer-001-16k | |
| pipeline_tag: text-classification | |
| model-index: | |
| - name: mimelens-001-medium-bpe-16k-s1 | |
| results: | |
| - task: | |
| type: feature-extraction | |
| name: MIME-125 classification (libmagic 125-class taxonomy) | |
| dataset: | |
| name: magic-frags (4 KB head of 64 KB random chunks, n=4,096) | |
| type: custom | |
| metrics: | |
| - name: top-1 accuracy | |
| type: accuracy | |
| value: 0.7988 | |
| - name: macro-F1 | |
| type: f1 | |
| value: 0.6375 | |
| - name: kNN R@1 | |
| type: recall@1 | |
| value: 0.6986 | |
| source: | |
| name: "MimeLens paper (Bommarito 2026), Appendix A" | |
| url: https://github.com/mjbommar/mimelens-training | |
| # mimelens-001-medium-bpe-16k-s1 | |
| A 37.76M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from *any* offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload. | |
| - **🔗 Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1) | |
| - **👥 Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (36 released cells: 28 parent + 8 short-sequence) | |
| - **🔤 Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k) | |
| - **📄 Paper**: *MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments* (Bommarito 2026) | |
| - **💻 Training code**: [`mjbommar/mimelens-training`](https://github.com/mjbommar/mimelens-training) | |
| - **📊 Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-corpus extracts, packed binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, and Windows drivers (33 GB stratified; the full corpus is not redistributable) | |
| --- | |
| ## What MimeLens does | |
| MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file. | |
| Existing tools assume whole-file access at a known offset: | |
| - [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head. | |
| - [Magika](https://github.com/google/magika) (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file. | |
| - TrID, PRONOM/Siegfried/DROID similarly require a complete file. | |
| These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness. | |
| The family ships 28 parent cells (3 sizes × 4 vocabs × 2-3 seeds at seq\_len=1024) plus an 8-cell short-sequence extension (medium tier × 4 vocabs × 2 seeds at seq\_len=256). This README documents one of them. | |
| > **Short-sequence sibling available.** If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-medium-bpe-16k-s1-seq256` instead. Same architecture, 4× shorter context, ~5× lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5. | |
| > **ONNX bundled.** This cell ships `onnx/model_fp32.onnx` + `onnx/model_int8.onnx` (dynamic int8 of MatMul/Gemm) for direct ONNX Runtime inference. See `onnx/README.md` in this repo for input/output shapes and the latency profile. | |
| --- | |
| ## Overview | |
| - **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1` | |
| - **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout) | |
| - **Input vocabulary**: `bpe-16k`. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer. | |
| - **Output**: 512-dim mean-pooled body-token embedding | |
| - **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix) | |
| - **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock | |
| - **License**: MIT | |
| ## Headline benchmarks (this cell) | |
| | Benchmark | Value | | |
| |---|---| | |
| | MIME-125 top-1 (magic-frags, 4 KB head, n=4,096) | **0.799** | | |
| | MIME-125 macro-F1 (magic-frags, 4 KB head) | 0.637 | | |
| | kNN R@1 (magic-frags, 3,147-file gallery / 949 queries) | 0.699 | | |
| | Δ top-1 under zero-first-16-byte header perturbation | −0.102 | | |
| | Δ top-1 under zero-first-64-byte header perturbation | −0.130 | | |
| | **Magika v1.1 calibration: strict top-1** (n=1,024) | **0.828** (vs Magika 0.653, +17.5 pp) | | |
| | Magika v1.1 calibration: aligned top-1 (21-class equiv map) | 0.829 (vs Magika 0.722, +10.7 pp) | | |
| | Magika v1.1 calibration: top-level top-1 | 0.927 (vs Magika 0.840, +8.7 pp) | | |
| | Real captured UDP traffic: top-1 from one 1.4 KB packet | 0.809 | | |
| | Real captured UDP traffic: top-1 from the entire stream | 0.821 | | |
| | CPU latency (single sample, p50, Intel i9-12900K): PyTorch fp32 | 202 ms | | |
| | CPU latency (single sample, p50, Intel i9-12900K): ONNX int8 | 382 ms | | |
| | CPU latency (single sample, p50, Intel i9-12900K): Magika v1.1 | 1.3 ms (~155×; hardware-dependent) | | |
| Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the [paper](https://github.com/mjbommar/mimelens-training). | |
| --- | |
| ## Quick start | |
| This cell ships a 125-class libmagic-MIME classifier head (the paper's LR probe, re-fit on the full magic-files corpus), so `pipeline("text-classification", ...)` works out of the box: | |
| ```python | |
| from transformers import pipeline | |
| clf = pipeline("text-classification", | |
| model="mjbommar/mimelens-001-medium-bpe-16k-s1", | |
| trust_remote_code=True, | |
| top_k=5) | |
| # The model reads the first ~1,022 tokens of whatever you pass (a prefix of the | |
| # buffer, not the whole window). For whole-file triage, a short head window | |
| # classifies magic-byte / compressed types better than a long one -- see | |
| # "Choosing a window" below. | |
| window = open("path/to/file", "rb").read(4096) | |
| preds = clf(window.decode("latin-1")) # latin-1 is a bijection over bytes | |
| # preds[0] is the list of {label, score} sorted by score: | |
| # [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...] | |
| ``` | |
| To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head): | |
| ```python | |
| import torch | |
| from transformers import AutoModel, AutoTokenizer | |
| repo = "mjbommar/mimelens-001-medium-bpe-16k-s1" | |
| model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval() | |
| tok = AutoTokenizer.from_pretrained(repo) | |
| window = open("path/to/file", "rb").read(4096) | |
| inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True, | |
| padding="max_length", return_tensors="pt") | |
| with torch.no_grad(): | |
| embedding = model(**inputs).pooler_output # (1, 512) | |
| ``` | |
| --- | |
| ## Choosing a window | |
| The model reads the first ~1{,}022 tokens of whatever you pass — a prefix of the buffer (for this BPE cell, whatever tokenizes to ~1{,}022 tokens, typically the first ~1.5--2.5 KB), not the whole window. | |
| - **Magic-byte / compressed types** (PNG, ZIP, GZIP, JPEG): a **short head window (256 B--1 KB) classifies better than 4 KB**. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns `application/octet-stream` on a mostly-opaque window — correct behaviour for genuinely high-entropy input, not a bug. | |
| - **Fragments / packets**: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for. | |
| --- | |
| ## Recommended deployment regimes | |
| - **libmagic-taxonomy (125-class) classification from a clean 4 KB chunk**: headline cell of the paper. | |
| - General-purpose deployment when one cell must serve mixed content (image + text + binary). | |
| --- | |
| ## Training | |
| This cell is one point of the 3 × 4 × {2,3} factorial cube described in the paper. | |
| - **Corpus** (33 GB, stratified multi-source): [`binary-30k`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, Windows drivers. | |
| - **Position-arbitrary windowing**: 1024-token windows sampled uniformly at random across files and 64 KB fragments. **No privileged "head of file" position.** This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs. | |
| - **Objective**: MLM with 30% mask ratio (BERT replacement schedule: 80% `[MASK]`, 10% random, 10% original); tied input/output embeddings. | |
| - **Pooling**: mean-pool over body tokens for downstream tasks. The BERT-style `cls_pool` linear projection is *not* used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper §3.4 verifies this; left in the saved weights for architectural completeness only). | |
| - **Optimisation**: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $\|g\|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates. | |
| - **Hardware**: single RTX 4060 Ti (16 GB), ~18.0 h wall-clock for this cell. | |
| --- | |
| ## Caveats | |
| - This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties. | |
| - The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora. | |
| - All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation. | |
| - CPU latency at the `medium` size is ~155× slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement. | |
| - End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{bommarito2026mimelens, | |
| title = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments}, | |
| author = {Bommarito II, Michael J.}, | |
| year = {2026}, | |
| note = {https://github.com/mjbommar/mimelens-training}, | |
| } | |
| ``` |