mjbommar
/

mimelens-001-medium-bpe-16k-s1

@@ -43,27 +43,29 @@ model-index:
 # mimelens-001-medium-bpe-16k-s1
-A small (37.76M-parameter) BERT-style encoder for **fine-grained file-content-type detection from binary data**. Give it any 4 KB byte buffer, regardless of where in a file it came from, and it produces a 512-dimensional embedding that downstream classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Useful when you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
-🔗 **Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
-👥 **Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; this is one)
-🔤 **Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
-📄 **Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026) — [GitHub](https://github.com/mjbommar/binary-embedding-paper)
 ---
-## What is MimeLens?
-**MimeLens classifies file content type from any 4 KB byte window, not just the first 4 KB of a complete file.**
 Existing tools assume whole-file access at a known offset:
 - [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
 - [Magika](https://github.com/google/magika) (Google) is a small CNN trained on three 512-byte windows (head, middle, tail) of a known-bounded file.
 - TrID, PRONOM/Siegfried/DROID similarly require a complete file.
-These break down when you only have a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One model handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (~348× slower than Magika at the medium size) in exchange for libmagic's 125-class taxonomy *plus* position arbitrariness.
-The family ships 28 cells (3 model sizes × 4 input vocabularies × 2–3 random seeds). This README documents one of them.
 ---
@@ -71,7 +73,7 @@ The family ships 28 cells (3 model sizes × 4 input vocabularies × 2–3 random
 - **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
 - **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
-- **Input vocabulary**: `bpe-16k` — 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
 - **Output**: 512-dim mean-pooled body-token embedding
 - **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
 - **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
@@ -101,7 +103,7 @@ Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real
 ## Quick start
-The model ships with a 125-class libmagic-MIME classifier head baked in (the paper's LR probe, re-fit on the full magic-files corpus). The one-liner path:
 ```python
 from transformers import pipeline
@@ -117,7 +119,7 @@ preds  = clf(window.decode("latin-1"))                 # latin-1 is a bijection
 # [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
 ```
-For users who want embeddings instead of a classifier (to fit their own probe or fine-tune a head), the encoder-only path:
 ```python
 import torch
@@ -134,11 +136,12 @@ with torch.no_grad():
     embedding = model(**inputs).pooler_output         # (1, 512)
 ```
 ---
 ## Recommended deployment regimes
-- **Fine-grained libmagic-taxonomy classification from a clean 4 KB chunk** — headline cell of the paper.
 - General-purpose deployment when one cell must serve mixed content (image + text + binary).
 ---
@@ -161,7 +164,7 @@ This cell is one point of the pre-registered 3 × 4 × {2,3} factorial cube desc
 - This is one cell of a 28-cell cube. Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the top of medium) are within seed noise and should be read as ties.
 - The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
 - All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
-- CPU latency at the `medium` size is ~348× slower than Magika v1.1; for sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
 - End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
 ---

 # mimelens-001-medium-bpe-16k-s1
+A 37.76M-backbone-parameter BERT-style encoder for fine-grained file-content-type detection from binary data. Takes any 4 KB byte buffer (regardless of source offset) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
+**🔗 Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
+**👥 Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; family hub forthcoming)
+**🔤 Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
+**📄 Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026). [GitHub](https://github.com/mjbommar/binary-embedding-paper) (source release forthcoming)
+**📊 Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-frags, glaurung, Windows drivers (33 GB stratified)
 ---
+## What MimeLens does
+MimeLens classifies file content type from any 4 KB byte window, not just the first 4 KB of a complete file.
 Existing tools assume whole-file access at a known offset:
 - [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
 - [Magika](https://github.com/google/magika) (Google) is a small CNN trained on three 512-byte windows (head, middle, tail) of a known-bounded file.
 - TrID, PRONOM/Siegfried/DROID similarly require a complete file.
+These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (~348× slower than Magika at the medium size) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.
+The family ships 28 cells: 3 model sizes × 4 input vocabularies × 2 or 3 random seeds. This README documents one of them.
 ---
 - **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
 - **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
+- **Input vocabulary**: `bpe-16k`. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
 - **Output**: 512-dim mean-pooled body-token embedding
 - **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
 - **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
 ## Quick start
+This cell ships with a 125-class libmagic-MIME classifier head baked in (the paper's LR probe, re-fit on the full magic-files corpus), so `pipeline("text-classification", ...)` works out of the box:
 ```python
 from transformers import pipeline
 # [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
 ```
+To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head):
 ```python
 import torch
     embedding = model(**inputs).pooler_output         # (1, 512)
 ```
 ---
 ## Recommended deployment regimes
+- **Fine-grained libmagic-taxonomy classification from a clean 4 KB chunk**: headline cell of the paper.
 - General-purpose deployment when one cell must serve mixed content (image + text + binary).
 ---
 - This is one cell of a 28-cell cube. Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the top of medium) are within seed noise and should be read as ties.
 - The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
 - All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
+- CPU latency at the `medium` size is ~348× slower than Magika v1.1. For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
 - End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
 ---