mimelens-001 cell: tiny/bpe-64k/s2

70ddfc2 verified 23 days ago

10.2 kB

	---
	license: mit
	library_name: transformers
	language:
	- en
	tags:
	- file-type-detection
	- mime-classification
	- binary-content
	- binary-analysis
	- position-agnostic
	- libmagic
	- forensics
	- packet-inspection
	- bpe
	- byte-pair-encoding
	- mimelens
	base_model: mjbommar/binary-tokenizer-001-64k
	pipeline_tag: text-classification
	model-index:
	- name: mimelens-001-tiny-bpe-64k-s2
	results:
	- task:
	type: feature-extraction
	name: MIME-125 classification (libmagic 125-class taxonomy)
	dataset:
	name: magic-frags (4 KB head of 64 KB random chunks, n=4,096)
	type: custom
	metrics:
	- name: top-1 accuracy
	type: accuracy
	value: 0.7324
	- name: macro-F1
	type: f1
	value: 0.6086
	- name: kNN R@1
	type: recall@1
	value: 0.6754
	source:
	name: "MimeLens paper (Bommarito 2026), Appendix A"
	url: https://github.com/mjbommar/mimelens-training
	---

	# mimelens-001-tiny-bpe-64k-s2

	A 3.15M-backbone-parameter BERT-style encoder for position-agnostic file-content-type detection from binary data. It reads a byte window taken from any offset in a file (the first ~1{,}022 tokens of whatever you pass) and produces a 256-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.

	- 🔗 Model: [`mjbommar/mimelens-001-tiny-bpe-64k-s2`](https://huggingface.co/mjbommar/mimelens-001-tiny-bpe-64k-s2)
	- 👥 Family: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (36 released cells: 28 parent + 8 short-sequence)
	- 🔤 Tokenizer: [`mjbommar/binary-tokenizer-001-64k`](https://huggingface.co/mjbommar/binary-tokenizer-001-64k)
	- 📄 Paper: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments (Bommarito 2026)
	- 💻 Training code: [`mjbommar/mimelens-training`](https://github.com/mjbommar/mimelens-training)
	- 📊 Pretraining corpus: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-corpus extracts, packed binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, and Windows drivers (33 GB stratified; the full corpus is not redistributable)

	---

	## What MimeLens does

	MimeLens classifies file content type from a byte window taken at any offset, not just the header of a complete file.

	Existing tools assume whole-file access at a known offset:

	- [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
	- [Magika](https://github.com/google/magika) (Google) is a small (~1 M-parameter) feedforward network over three 512-byte windows (head, middle, tail) of a known-bounded file.
	- TrID, PRONOM/Siegfried/DROID similarly require a complete file.

	These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled uniformly at random across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (roughly two orders of magnitude slower than Magika at the medium size; hardware-dependent) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.

	The family ships 28 parent cells (3 sizes × 4 vocabs × 2-3 seeds at seq\_len=1024) plus an 8-cell short-sequence extension (medium tier × 4 vocabs × 2 seeds at seq\_len=256). This README documents one of them.

	> Short-sequence sibling available. If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-tiny-bpe-64k-s2-seq256` instead. Same architecture, 4× shorter context, ~5× lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.



	---

	## Overview

	- This cell: `tiny` tier, `bpe-64k` input pipeline, seed `2`
	- Backbone: 3.15M parameters (4 layers, hidden 256, 4 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
	- Input vocabulary: `bpe-64k`. 65,536-entry binary BPE tokenizer (binary-tokenizer-001-64k), ~2.09 bytes/token. Reads ~2,134 bytes of the 4 KB buffer.
	- Output: 256-dim mean-pooled body-token embedding
	- Label space: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
	- Pretraining: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~2.7 h wall-clock
	- License: MIT

	## Headline benchmarks (this cell)

	\| Benchmark \| Value \|
	\|---\|---\|
	\| MIME-125 top-1 (magic-frags, 4 KB head, n=4,096) \| 0.732 \|
	\| MIME-125 macro-F1 (magic-frags, 4 KB head) \| 0.609 \|
	\| kNN R@1 (magic-frags, 3,147-file gallery / 949 queries) \| 0.675 \|

	Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real-network curves, disk-block matrix, baselines against libmagic 5.46 and TrID 2.24) is in the [paper](https://github.com/mjbommar/mimelens-training).

	---

	## Quick start

	This cell publishes the encoder only (no classifier head baked in). Use it to extract embeddings, then fit a probe, run kNN over a labelled gallery, or fine-tune a head:

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	repo = "mjbommar/mimelens-001-tiny-bpe-64k-s2"
	model = AutoModel.from_pretrained(repo, trust_remote_code=True).eval()
	tok = AutoTokenizer.from_pretrained(repo)

	window = open("path/to/file", "rb").read(4096)
	inputs = tok(window.decode("latin-1"), max_length=1024, truncation=True,
	padding="max_length", return_tensors="pt")
	with torch.no_grad():
	embedding = model(**inputs).pooler_output # (1, 256)
	```

	The pre-fit LR probe weights for this cell are not bundled here. The deployed cells and per-size winners (e.g. `mimelens-001-medium-bpe-16k-s1`) ship a baked classifier head for a one-line `pipeline()` path.


	---

	## Choosing a window

	The model reads the first ~1{,}022 tokens of whatever you pass — a prefix of the buffer (for this BPE cell, whatever tokenizes to ~1{,}022 tokens, typically the first ~1.5--2.5 KB), not the whole window.

	- Magic-byte / compressed types (PNG, ZIP, GZIP, JPEG): a short head window (256 B--1 KB) classifies better than 4 KB. A long high-entropy body dilutes the header signal within the fixed token budget, and the model returns `application/octet-stream` on a mostly-opaque window — correct behaviour for genuinely high-entropy input, not a bug.
	- Fragments / packets: you cannot choose the offset, so pass what you have. This is the regime MimeLens is built for.

	---

	## Recommended deployment regimes

	See the family hub README ([`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001)) for the regime decision tree.

	---

	## Training

	This cell is one point of the 3 × 4 × 2 factorial cube described in the paper.

	- Corpus (33 GB, stratified multi-source): [`binary-30k`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) (assorted ELF/PE/Mach-O), magic-frags (random 64 KB chunks across libmagic's full corpus), assorted packed/raw binaries, a [`glaurung`](https://github.com/mjbommar/glaurung)-sourced binary corpus, Windows drivers.
	- Position-arbitrary windowing: 1024-token windows sampled uniformly at random across files and 64 KB fragments. No privileged "head of file" position. This is the design choice that makes MimeLens work on streaming / partial / random-offset inputs.
	- Objective: MLM with 30% mask ratio (BERT replacement schedule: 80% `[MASK]`, 10% random, 10% original); tied input/output embeddings.
	- Pooling: mean-pool over body tokens for downstream tasks. The BERT-style `cls_pool` linear projection is not used: under MLM-only training it receives no gradient and remains byte-identical to its random initialisation across all 28 cube cells (paper §3.4 verifies this; left in the saved weights for architectural completeness only).
	- Optimisation: AdamW + cosine LR (peak 5e-4, 2,000-step warmup, 10% floor), bf16 mixed precision, gradient clipping at $\\|g\\|_2 \leq 1$, effective batch 128 at sequence length 1024, 22,888 gradient updates.
	- Hardware: single RTX 4060 Ti (16 GB), ~2.7 h wall-clock for this cell.

	---

	## Caveats

	- This is one cell of a 28-cell parent cube (36 released cells including the 8-cell short-sequence extension). Within-cube comparisons in the paper carry bootstrap CIs at n=2 seeds; some marginal orderings (byte vs bpe-16k at the largest size) are within seed noise and should be read as ties.
	- The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
	- All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
	- CPU latency at the `medium` size is ~155× slower than Magika v1.1 on a desktop CPU (hardware-dependent). For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
	- End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.

	---

	## Citation

	```bibtex
	@misc{bommarito2026mimelens,
	title = {MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments},
	author = {Bommarito II, Michael J.},
	year = {2026},
	note = {https://github.com/mjbommar/mimelens-training},
	}
	```