mjbommar commited on
Commit
b4f00df
Β·
verified Β·
1 Parent(s): a6ea17f

README review pass: lead-paragraph hygiene, badge consistency, drop self-thanks, soften 'robust' language

Browse files
Files changed (1) hide show
  1. README.md +17 -14
README.md CHANGED
@@ -43,27 +43,29 @@ model-index:
43
 
44
  # mimelens-001-medium-bpe-16k-s1
45
 
46
- A small (37.76M-parameter) BERT-style encoder for **fine-grained file-content-type detection from binary data**. Give it any 4 KB byte buffer, regardless of where in a file it came from, and it produces a 512-dimensional embedding that downstream classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Useful when you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
47
 
48
- πŸ”— **Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
49
- πŸ‘₯ **Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; this is one)
50
- πŸ”€ **Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
51
- πŸ“„ **Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026) β€” [GitHub](https://github.com/mjbommar/binary-embedding-paper)
 
52
 
53
  ---
54
 
55
- ## What is MimeLens?
56
 
57
- **MimeLens classifies file content type from any 4 KB byte window, not just the first 4 KB of a complete file.**
58
 
59
  Existing tools assume whole-file access at a known offset:
 
60
  - [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
61
  - [Magika](https://github.com/google/magika) (Google) is a small CNN trained on three 512-byte windows (head, middle, tail) of a known-bounded file.
62
  - TrID, PRONOM/Siegfried/DROID similarly require a complete file.
63
 
64
- These break down when you only have a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One model handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (~348Γ— slower than Magika at the medium size) in exchange for libmagic's 125-class taxonomy *plus* position arbitrariness.
65
 
66
- The family ships 28 cells (3 model sizes Γ— 4 input vocabularies Γ— 2–3 random seeds). This README documents one of them.
67
 
68
  ---
69
 
@@ -71,7 +73,7 @@ The family ships 28 cells (3 model sizes Γ— 4 input vocabularies Γ— 2–3 random
71
 
72
  - **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
73
  - **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
74
- - **Input vocabulary**: `bpe-16k` β€” 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
75
  - **Output**: 512-dim mean-pooled body-token embedding
76
  - **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
77
  - **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
@@ -101,7 +103,7 @@ Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real
101
 
102
  ## Quick start
103
 
104
- The model ships with a 125-class libmagic-MIME classifier head baked in (the paper's LR probe, re-fit on the full magic-files corpus). The one-liner path:
105
 
106
  ```python
107
  from transformers import pipeline
@@ -117,7 +119,7 @@ preds = clf(window.decode("latin-1")) # latin-1 is a bijection
117
  # [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
118
  ```
119
 
120
- For users who want embeddings instead of a classifier (to fit their own probe or fine-tune a head), the encoder-only path:
121
 
122
  ```python
123
  import torch
@@ -134,11 +136,12 @@ with torch.no_grad():
134
  embedding = model(**inputs).pooler_output # (1, 512)
135
  ```
136
 
 
137
  ---
138
 
139
  ## Recommended deployment regimes
140
 
141
- - **Fine-grained libmagic-taxonomy classification from a clean 4 KB chunk** β€” headline cell of the paper.
142
  - General-purpose deployment when one cell must serve mixed content (image + text + binary).
143
 
144
  ---
@@ -161,7 +164,7 @@ This cell is one point of the pre-registered 3 Γ— 4 Γ— {2,3} factorial cube desc
161
  - This is one cell of a 28-cell cube. Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the top of medium) are within seed noise and should be read as ties.
162
  - The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
163
  - All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
164
- - CPU latency at the `medium` size is ~348Γ— slower than Magika v1.1; for sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
165
  - End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
166
 
167
  ---
 
43
 
44
  # mimelens-001-medium-bpe-16k-s1
45
 
46
+ A 37.76M-backbone-parameter BERT-style encoder for fine-grained file-content-type detection from binary data. Takes any 4 KB byte buffer (regardless of source offset) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
47
 
48
+ **πŸ”— Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
49
+ **πŸ‘₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; family hub forthcoming)
50
+ **πŸ”€ Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
51
+ **πŸ“„ Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026). [GitHub](https://github.com/mjbommar/binary-embedding-paper) (source release forthcoming)
52
+ **πŸ“Š Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-frags, glaurung, Windows drivers (33 GB stratified)
53
 
54
  ---
55
 
56
+ ## What MimeLens does
57
 
58
+ MimeLens classifies file content type from any 4 KB byte window, not just the first 4 KB of a complete file.
59
 
60
  Existing tools assume whole-file access at a known offset:
61
+
62
  - [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
63
  - [Magika](https://github.com/google/magika) (Google) is a small CNN trained on three 512-byte windows (head, middle, tail) of a known-bounded file.
64
  - TrID, PRONOM/Siegfried/DROID similarly require a complete file.
65
 
66
+ These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (~348Γ— slower than Magika at the medium size) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.
67
 
68
+ The family ships 28 cells: 3 model sizes Γ— 4 input vocabularies Γ— 2 or 3 random seeds. This README documents one of them.
69
 
70
  ---
71
 
 
73
 
74
  - **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
75
  - **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
76
+ - **Input vocabulary**: `bpe-16k`. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
77
  - **Output**: 512-dim mean-pooled body-token embedding
78
  - **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
79
  - **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
 
103
 
104
  ## Quick start
105
 
106
+ This cell ships with a 125-class libmagic-MIME classifier head baked in (the paper's LR probe, re-fit on the full magic-files corpus), so `pipeline("text-classification", ...)` works out of the box:
107
 
108
  ```python
109
  from transformers import pipeline
 
119
  # [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
120
  ```
121
 
122
+ To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head):
123
 
124
  ```python
125
  import torch
 
136
  embedding = model(**inputs).pooler_output # (1, 512)
137
  ```
138
 
139
+
140
  ---
141
 
142
  ## Recommended deployment regimes
143
 
144
+ - **Fine-grained libmagic-taxonomy classification from a clean 4 KB chunk**: headline cell of the paper.
145
  - General-purpose deployment when one cell must serve mixed content (image + text + binary).
146
 
147
  ---
 
164
  - This is one cell of a 28-cell cube. Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the top of medium) are within seed noise and should be read as ties.
165
  - The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
166
  - All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
167
+ - CPU latency at the `medium` size is ~348Γ— slower than Magika v1.1. For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
168
  - End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
169
 
170
  ---