Text Classification
Transformers
ONNX
Safetensors
English
mimelens
file-type-detection
mime-classification
binary-content
binary-analysis
position-agnostic
libmagic
forensics
packet-inspection
bpe
byte-pair-encoding
custom_code
Eval Results (legacy)
Instructions to use mjbommar/mimelens-001-medium-bpe-16k-s1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mjbommar/mimelens-001-medium-bpe-16k-s1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True)# Load model directly from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 12,010 Bytes
f4e3005 a9be04d f4e3005 a9be04d f4e3005 a9be04d f4e3005 a9be04d f4e3005 a6ea17f f4e3005 a9be04d f4e3005 90cfc21 f4e3005 a9be04d f4e3005 90cfc21 f4e3005 387487e 90cfc21 387487e 90cfc21 f4e3005 a9be04d b4f00df a9be04d 90cfc21 a9be04d b4f00df a9be04d 90cfc21 a9be04d 90cfc21 a9be04d fcee90a a9be04d 91b72cd a9be04d b4f00df a9be04d f4e3005 a9be04d 90cfc21 a9be04d 90cfc21 a9be04d f4e3005 90cfc21 a6ea17f 90cfc21 a6ea17f 90cfc21 a6ea17f b4f00df a6ea17f f4e3005 a6ea17f f4e3005 a6ea17f f4e3005 a6ea17f f4e3005 a6ea17f f4e3005 a6ea17f f4e3005 b4f00df 90cfc21 a9be04d f4e3005 90cfc21 f4e3005 a9be04d f4e3005 90cfc21 a9be04d 90cfc21 a9be04d f4e3005 a9be04d f4e3005 a9be04d f4e3005 90cfc21 f4e3005 a9be04d 90cfc21 a9be04d f4e3005 90cfc21 f4e3005 90cfc21 f4e3005 a6ea17f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | ---
license: mit
library_name: transformers
language:
- en
tags:
- file-type-detection
- mime-classification
- binary-content
- binary-analysis
- position-agnostic
- libmagic
- forensics
- packet-inspection
- bpe
- byte-pair-encoding
- mimelens
base_model: mjbommar/binary-tokenizer-001-16k
pipeline_tag: text-classification
model-index:
- name: mimelens-001-medium-bpe-16k-s1
results:
- task:
type: feature-extraction
name: MIME-125 classification (libmagic 125-class taxonomy)
dataset:
name: magic-frags (4 KB head of 64 KB random chunks, n=4,096)
type: custom
metrics:
- name: top-1 accuracy
type: accuracy
value: 0.7988
- name: macro-F1
type: f1
value: 0.6375
- name: kNN R@1
type: recall@1
value: 0.6986
source:
name: "MimeLens paper (Bommarito 2026), Appendix A"
url: https://github.com/mjbommar/mimelens-training
---
# mimelens-001-medium-bpe-16k-s1
A 37.76M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from *any* offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
- **π Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
- **π₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (36 released cells: 28 parent + 8 short-sequence)
- **π€ Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
- **π Paper**: *MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments* (Bommarito 2026)
- **π» Training code**: [`mjbommar/mimelens-training`](https://github.com/mjbommar/mimelens-training)
- **π Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-corpus extracts, packed binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, and Windows drivers (33 GB stratified; the full corpus is not redistributable)
---
## What MimeLens does
MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file.
Existing tools assume whole-file access at a known offset:
- [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
- [Magika](https://github.com/google/magika) (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file.
- TrID, PRONOM/Siegfried/DROID similarly require a complete file.
These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.
The family ships 28 parent cells (3 sizes Γ 4 vocabs Γ 2-3 seeds at seq\_len=1024) plus an 8-cell short-sequence extension (medium tier Γ 4 vocabs Γ 2 seeds at seq\_len=256). This README documents one of them.
> **Short-sequence sibling available.** If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-medium-bpe-16k-s1-seq256` instead. Same architecture, 4Γ shorter context, ~5Γ lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.
> **ONNX bundled.** This cell ships `onnx/model_fp32.onnx` + `onnx/model_int8.onnx` (dynamic int8 of MatMul/Gemm) for direct ONNX Runtime inference. See `onnx/README.md` in this repo for input/output shapes and the latency profile.
---
## Overview
- **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
- **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
- **Input vocabulary**: `bpe-16k`. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
- **Output**: 512-dim mean-pooled body-token embedding
- **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
- **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
- **License**: MIT
## Headline benchmarks (this cell)
| Benchmark | Value |
|---|---|
| MIME-125 top-1 (magic-frags, 4 KB head, n=4,096) | **0.799** |
| MIME-125 macro-F1 (magic-frags, 4 KB head) | 0.637 |
| kNN R@1 (magic-frags, 3,147-file gallery / 949 queries) | 0.699 |
| Ξ top-1 under zero-first-16-byte header perturbation | β0.102 |
| Ξ top-1 under zero-first-64-byte header perturbation | β0.130 |
| **Magika v1.1 calibration: strict top-1** (n=1,024) | **0.828** (vs Magika 0.653, +17.5 pp) |
| Magika v1.1 calibration: aligned top-1 (21-class equiv map) | 0.829 (vs Magika 0.722, +10.7 pp) |
| Magika v1.1 calibration: top-level top-1 | 0.927 (vs Magika 0.840, +8.7 pp) |
| Real captured UDP traffic: top-1 from one 1.4 KB packet | 0.809 |
| Real captured UDP traffic: top-1 from the entire stream | 0.821 |
| CPU latency (single sample, p50, Intel i9-12900K): PyTorch fp32 | 202 ms |
| CPU latency (single sample, p50, Intel i9-12900K): ONNX int8 | 382 ms |
| CPU latency (single sample, p50, Intel i9-12900K): Magika v1.1 | 1.3 ms (~155Γ; hardware-dependent) |
Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the [paper](https://github.com/mjbommar/mimelens-training).
---
## Quick start
This cell ships a 125-class libmagic-MIME classifier head (the paper's LR probe, re-fit on the full magic-files corpus), so `pipeline("text-classification", ...)` works out of the box:
```python
from transformers import pipeline
clf = pipeline("text-classification",
model="mjbommar/mimelens-001-medium-bpe-16k-s1",
trust_remote_code=True,
top_k=5)
# The model reads the first ~1,022 tokens of whatever you pass (a prefix of the
# buffer, not the whole window). For whole-file triage, a short head window
# classifies magic-byte / compressed types better than a long one -- see
# "Choosing a window" below.
window = open("path/to/file", "rb").read(4096)
preds = clf(window.decode("latin-1")) # latin-1 is a bijection over bytes
# preds[0] is the list of {label, score} sorted by score:
# [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
```
To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head):
```python
import torch
from transformers import AutoModel, AutoTokenizer
repo = "mjbommar/mimelens-001-medium-bpe-16k-s1"
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained(repo)
window = open("path/to/file", "rb").read(4096)
inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True,
padding="max_length", return_tensors="pt")
with torch.no_grad():
embedding = model(**inputs).pooler_output # (1, 512)
```
---
## Choosing a window
The model reads the first ~1{,}022 tokens of whatever you pass β a prefix of the buffer (for this BPE cell, whatever tokenizes to ~1{,}022 tokens, typically the first ~1.5--2.5 KB), not the whole window.
- **Magic-byte / compressed types** (PNG, ZIP, GZIP, JPEG): a **short head window (256 B--1 KB) classifies better than 4 KB**. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns `application/octet-stream` on a mostly-opaque window β correct behaviour for genuinely high-entropy input, not a bug.
- **Fragments / packets**: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for.
---
## Recommended deployment regimes
- **libmagic-taxonomy (125-class) classification from a clean 4 KB chunk**: headline cell of the paper.
- General-purpose deployment when one cell must serve mixed content (image + text + binary).
---
## Training
This cell is one point of the 3 Γ 4 Γ {2,3} factorial cube described in the paper.
- **Corpus** (33 GB, stratified multi-source): [`binary-30k`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, Windows drivers.
- **Position-arbitrary windowing**: 1024-token windows sampled uniformly at random across files and 64 KB fragments. **No privileged "head of file" position.** This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs.
- **Objective**: MLM with 30% mask ratio (BERT replacement schedule: 80% `[MASK]`, 10% random, 10% original); tied input/output embeddings.
- **Pooling**: mean-pool over body tokens for downstream tasks. The BERT-style `cls_pool` linear projection is *not* used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper Β§3.4 verifies this; left in the saved weights for architectural completeness only).
- **Optimisation**: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $\|g\|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates.
- **Hardware**: single RTX 4060 Ti (16 GB), ~18.0 h wall-clock for this cell.
---
## Caveats
- This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties.
- The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
- All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
- CPU latency at the `medium` size is ~155Γ slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
- End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
---
## Citation
```bibtex
@misc{bommarito2026mimelens,
title = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
author = {Bommarito II, Michael J.},
year = {2026},
note = {https://github.com/mjbommar/mimelens-training},
}
``` |