File size: 12,010 Bytes
f4e3005
 
 
a9be04d
 
f4e3005
 
 
 
a9be04d
f4e3005
 
a9be04d
 
f4e3005
a9be04d
f4e3005
 
a6ea17f
f4e3005
 
 
 
 
 
 
a9be04d
 
f4e3005
 
 
 
 
 
 
 
 
 
 
 
90cfc21
f4e3005
 
a9be04d
f4e3005
90cfc21
f4e3005
387487e
90cfc21
387487e
90cfc21
 
 
f4e3005
a9be04d
 
b4f00df
a9be04d
90cfc21
a9be04d
 
b4f00df
a9be04d
90cfc21
a9be04d
 
90cfc21
a9be04d
fcee90a
 
 
 
a9be04d
91b72cd
 
 
a9be04d
 
 
 
 
 
b4f00df
a9be04d
 
 
 
 
 
f4e3005
a9be04d
 
 
 
 
 
 
 
 
 
 
 
90cfc21
 
 
a9be04d
90cfc21
a9be04d
 
 
 
f4e3005
90cfc21
a6ea17f
 
 
 
 
 
 
 
 
90cfc21
 
 
 
 
a6ea17f
90cfc21
a6ea17f
 
 
b4f00df
a6ea17f
f4e3005
 
a6ea17f
f4e3005
a6ea17f
f4e3005
a6ea17f
f4e3005
 
a6ea17f
 
f4e3005
a6ea17f
f4e3005
 
b4f00df
90cfc21
 
 
 
 
 
 
 
 
a9be04d
f4e3005
 
 
90cfc21
f4e3005
 
a9be04d
 
f4e3005
 
90cfc21
a9be04d
90cfc21
a9be04d
 
 
 
 
f4e3005
a9be04d
f4e3005
a9be04d
f4e3005
90cfc21
f4e3005
a9be04d
90cfc21
a9be04d
 
 
f4e3005
 
 
 
 
90cfc21
f4e3005
 
90cfc21
f4e3005
a6ea17f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: mit
library_name: transformers
language:
  - en
tags:
  - file-type-detection
  - mime-classification
  - binary-content
  - binary-analysis
  - position-agnostic
  - libmagic
  - forensics
  - packet-inspection
  - bpe
  - byte-pair-encoding
  - mimelens
base_model: mjbommar/binary-tokenizer-001-16k
pipeline_tag: text-classification
model-index:
  - name: mimelens-001-medium-bpe-16k-s1
    results:
      - task:
          type: feature-extraction
          name: MIME-125 classification (libmagic 125-class taxonomy)
        dataset:
          name: magic-frags (4 KB head of 64 KB random chunks, n=4,096)
          type: custom
        metrics:
          - name: top-1 accuracy
            type: accuracy
            value: 0.7988
          - name: macro-F1
            type: f1
            value: 0.6375
          - name: kNN R@1
            type: recall@1
            value: 0.6986
        source:
          name: "MimeLens paper (Bommarito 2026), Appendix A"
          url: https://github.com/mjbommar/mimelens-training
---

# mimelens-001-medium-bpe-16k-s1

A 37.76M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from *any* offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.

- **πŸ”— Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
- **πŸ‘₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (36 released cells: 28 parent + 8 short-sequence)
- **πŸ”€ Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
- **πŸ“„ Paper**: *MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments* (Bommarito 2026)
- **πŸ’» Training code**: [`mjbommar/mimelens-training`](https://github.com/mjbommar/mimelens-training)
- **πŸ“Š Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-corpus extracts, packed binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, and Windows drivers (33 GB stratified; the full corpus is not redistributable)

---

## What MimeLens does

MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file.

Existing tools assume whole-file access at a known offset:

- [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
- [Magika](https://github.com/google/magika) (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file.
- TrID, PRONOM/Siegfried/DROID similarly require a complete file.

These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.

The family ships 28 parent cells (3 sizes Γ— 4 vocabs Γ— 2-3 seeds at seq\_len=1024) plus an 8-cell short-sequence extension (medium tier Γ— 4 vocabs Γ— 2 seeds at seq\_len=256). This README documents one of them.

> **Short-sequence sibling available.** If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-medium-bpe-16k-s1-seq256` instead. Same architecture, 4Γ— shorter context, ~5Γ— lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.


> **ONNX bundled.** This cell ships `onnx/model_fp32.onnx` + `onnx/model_int8.onnx` (dynamic int8 of MatMul/Gemm) for direct ONNX Runtime inference. See `onnx/README.md` in this repo for input/output shapes and the latency profile.


---

## Overview

- **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
- **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
- **Input vocabulary**: `bpe-16k`. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
- **Output**: 512-dim mean-pooled body-token embedding
- **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
- **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
- **License**: MIT

## Headline benchmarks (this cell)

| Benchmark | Value |
|---|---|
| MIME-125 top-1 (magic-frags, 4 KB head, n=4,096)            | **0.799** |
| MIME-125 macro-F1 (magic-frags, 4 KB head)                  | 0.637 |
| kNN R@1 (magic-frags, 3,147-file gallery / 949 queries)     | 0.699 |
| Ξ” top-1 under zero-first-16-byte header perturbation        | βˆ’0.102 |
| Ξ” top-1 under zero-first-64-byte header perturbation        | βˆ’0.130 |
| **Magika v1.1 calibration: strict top-1** (n=1,024)         | **0.828** (vs Magika 0.653, +17.5 pp) |
| Magika v1.1 calibration: aligned top-1 (21-class equiv map) | 0.829 (vs Magika 0.722, +10.7 pp)     |
| Magika v1.1 calibration: top-level top-1                    | 0.927 (vs Magika 0.840, +8.7 pp)      |
| Real captured UDP traffic: top-1 from one 1.4 KB packet     | 0.809 |
| Real captured UDP traffic: top-1 from the entire stream     | 0.821 |
| CPU latency (single sample, p50, Intel i9-12900K): PyTorch fp32 | 202 ms |
| CPU latency (single sample, p50, Intel i9-12900K): ONNX int8    | 382 ms |
| CPU latency (single sample, p50, Intel i9-12900K): Magika v1.1  | 1.3 ms (~155Γ—; hardware-dependent) |

Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the [paper](https://github.com/mjbommar/mimelens-training).

---

## Quick start

This cell ships a 125-class libmagic-MIME classifier head (the paper's LR probe, re-fit on the full magic-files corpus), so `pipeline("text-classification", ...)` works out of the box:

```python
from transformers import pipeline

clf = pipeline("text-classification",
               model="mjbommar/mimelens-001-medium-bpe-16k-s1",
               trust_remote_code=True,
               top_k=5)

# The model reads the first ~1,022 tokens of whatever you pass (a prefix of the
# buffer, not the whole window). For whole-file triage, a short head window
# classifies magic-byte / compressed types better than a long one -- see
# "Choosing a window" below.
window = open("path/to/file", "rb").read(4096)
preds  = clf(window.decode("latin-1"))                 # latin-1 is a bijection over bytes
# preds[0] is the list of {label, score} sorted by score:
# [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
```

To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head):

```python
import torch
from transformers import AutoModel, AutoTokenizer

repo  = "mjbommar/mimelens-001-medium-bpe-16k-s1"
model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained(repo)

window = open("path/to/file", "rb").read(4096)
inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True,
             padding="max_length", return_tensors="pt")
with torch.no_grad():
    embedding = model(**inputs).pooler_output         # (1, 512)
```


---

## Choosing a window

The model reads the first ~1{,}022 tokens of whatever you pass β€” a prefix of the buffer (for this BPE cell, whatever tokenizes to ~1{,}022 tokens, typically the first ~1.5--2.5 KB), not the whole window.

- **Magic-byte / compressed types** (PNG, ZIP, GZIP, JPEG): a **short head window (256 B--1 KB) classifies better than 4 KB**. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns `application/octet-stream` on a mostly-opaque window β€” correct behaviour for genuinely high-entropy input, not a bug.
- **Fragments / packets**: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for.

---

## Recommended deployment regimes

- **libmagic-taxonomy (125-class) classification from a clean 4 KB chunk**: headline cell of the paper.
- General-purpose deployment when one cell must serve mixed content (image + text + binary).

---

## Training

This cell is one point of the 3 Γ— 4 Γ— {2,3} factorial cube described in the paper.

- **Corpus** (33 GB, stratified multi-source): [`binary-30k`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, Windows drivers.
- **Position-arbitrary windowing**: 1024-token windows sampled uniformly at random across files and 64 KB fragments. **No privileged "head of file" position.** This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs.
- **Objective**: MLM with 30% mask ratio (BERT replacement schedule: 80% `[MASK]`, 10% random, 10% original); tied input/output embeddings.
- **Pooling**: mean-pool over body tokens for downstream tasks. The BERT-style `cls_pool` linear projection is *not* used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper Β§3.4 verifies this; left in the saved weights for architectural completeness only).
- **Optimisation**: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $\|g\|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates.
- **Hardware**: single RTX 4060 Ti (16 GB), ~18.0 h wall-clock for this cell.

---

## Caveats

- This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties.
- The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
- All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
- CPU latency at the `medium` size is ~155Γ— slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
- End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.

---

## Citation

```bibtex
@misc{bommarito2026mimelens,
  title  = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
  author = {Bommarito II, Michael J.},
  year   = {2026},
  note   = {https://github.com/mjbommar/mimelens-training},
}
```