Text Classification
Transformers
ONNX
Safetensors
English
mimelens
file-type-detection
mime-classification
binary-content
binary-analysis
position-agnostic
libmagic
forensics
packet-inspection
bpe
byte-pair-encoding
custom_code
Eval Results (legacy)
Instructions to use mjbommar/mimelens-001-medium-bpe-16k-s1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mjbommar/mimelens-001-medium-bpe-16k-s1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True)# Load model directly from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
README review pass: lead-paragraph hygiene, badge consistency, drop self-thanks, soften 'robust' language
Browse files
README.md
CHANGED
|
@@ -43,27 +43,29 @@ model-index:
|
|
| 43 |
|
| 44 |
# mimelens-001-medium-bpe-16k-s1
|
| 45 |
|
| 46 |
-
A
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
| 52 |
|
| 53 |
---
|
| 54 |
|
| 55 |
-
## What
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
Existing tools assume whole-file access at a known offset:
|
|
|
|
| 60 |
- [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
|
| 61 |
- [Magika](https://github.com/google/magika) (Google) is a small CNN trained on three 512-byte windows (head, middle, tail) of a known-bounded file.
|
| 62 |
- TrID, PRONOM/Siegfried/DROID similarly require a complete file.
|
| 63 |
|
| 64 |
-
These break down
|
| 65 |
|
| 66 |
-
The family ships 28 cells
|
| 67 |
|
| 68 |
---
|
| 69 |
|
|
@@ -71,7 +73,7 @@ The family ships 28 cells (3 model sizes Γ 4 input vocabularies Γ 2β3 random
|
|
| 71 |
|
| 72 |
- **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
|
| 73 |
- **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
|
| 74 |
-
- **Input vocabulary**: `bpe-16k`
|
| 75 |
- **Output**: 512-dim mean-pooled body-token embedding
|
| 76 |
- **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
|
| 77 |
- **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
|
|
@@ -101,7 +103,7 @@ Full evaluation (within-cube bootstrap CIs, adversarial sweep, calibration, real
|
|
| 101 |
|
| 102 |
## Quick start
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
```python
|
| 107 |
from transformers import pipeline
|
|
@@ -117,7 +119,7 @@ preds = clf(window.decode("latin-1")) # latin-1 is a bijection
|
|
| 117 |
# [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
|
| 118 |
```
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
```python
|
| 123 |
import torch
|
|
@@ -134,11 +136,12 @@ with torch.no_grad():
|
|
| 134 |
embedding = model(**inputs).pooler_output # (1, 512)
|
| 135 |
```
|
| 136 |
|
|
|
|
| 137 |
---
|
| 138 |
|
| 139 |
## Recommended deployment regimes
|
| 140 |
|
| 141 |
-
- **Fine-grained libmagic-taxonomy classification from a clean 4 KB chunk**
|
| 142 |
- General-purpose deployment when one cell must serve mixed content (image + text + binary).
|
| 143 |
|
| 144 |
---
|
|
@@ -161,7 +164,7 @@ This cell is one point of the pre-registered 3 Γ 4 Γ {2,3} factorial cube desc
|
|
| 161 |
- This is one cell of a 28-cell cube. Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the top of medium) are within seed noise and should be read as ties.
|
| 162 |
- The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
|
| 163 |
- All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
|
| 164 |
-
- CPU latency at the `medium` size is ~348Γ slower than Magika v1.1
|
| 165 |
- End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
|
| 166 |
|
| 167 |
---
|
|
|
|
| 43 |
|
| 44 |
# mimelens-001-medium-bpe-16k-s1
|
| 45 |
|
| 46 |
+
A 37.76M-backbone-parameter BERT-style encoder for fine-grained file-content-type detection from binary data. Takes any 4 KB byte buffer (regardless of source offset) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
|
| 47 |
|
| 48 |
+
**π Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
|
| 49 |
+
**π₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; family hub forthcoming)
|
| 50 |
+
**π€ Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
|
| 51 |
+
**π Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026). [GitHub](https://github.com/mjbommar/binary-embedding-paper) (source release forthcoming)
|
| 52 |
+
**π Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-frags, glaurung, Windows drivers (33 GB stratified)
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
+
## What MimeLens does
|
| 57 |
|
| 58 |
+
MimeLens classifies file content type from any 4 KB byte window, not just the first 4 KB of a complete file.
|
| 59 |
|
| 60 |
Existing tools assume whole-file access at a known offset:
|
| 61 |
+
|
| 62 |
- [`libmagic`](https://github.com/file/file) and [Apache Tika](https://tika.apache.org/) match handcrafted magic-byte signatures, almost always anchored at the file head.
|
| 63 |
- [Magika](https://github.com/google/magika) (Google) is a small CNN trained on three 512-byte windows (head, middle, tail) of a known-bounded file.
|
| 64 |
- TrID, PRONOM/Siegfried/DROID similarly require a complete file.
|
| 65 |
|
| 66 |
+
These break down on a fragment. MimeLens is pretrained MLM-only on 1024-token windows sampled *uniformly at random* across files and 64 KB fragments, with no privileged head-of-file position. One checkpoint handles streaming, partial-arrival, mid-file, packet-payload, and forensic-carved inputs uniformly. The trade-off is CPU latency (~348Γ slower than Magika at the medium size) in exchange for libmagic's 125-class taxonomy plus position arbitrariness.
|
| 67 |
|
| 68 |
+
The family ships 28 cells: 3 model sizes Γ 4 input vocabularies Γ 2 or 3 random seeds. This README documents one of them.
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
|
|
| 73 |
|
| 74 |
- **This cell**: `medium` tier, `bpe-16k` input pipeline, seed `1`
|
| 75 |
- **Backbone**: 37.76M parameters (12 layers, hidden 512, 8 attention heads, head dim 64, RoPE, RMSNorm, no biases, no dropout)
|
| 76 |
+
- **Input vocabulary**: `bpe-16k`. 16,384-entry binary BPE tokenizer (binary-tokenizer-001-16k), ~1.73 bytes/token. Reads ~1,765 bytes of the 4 KB buffer.
|
| 77 |
- **Output**: 512-dim mean-pooled body-token embedding
|
| 78 |
- **Label space**: [libmagic](https://github.com/file/file) 125-class MIME taxonomy (full list in paper Appendix)
|
| 79 |
- **Pretraining**: MLM-only, 30% mask ratio, 33 GB stratified multi-source binary corpus, 22,888 gradient updates, single RTX 4060 Ti, ~18.0 h wall-clock
|
|
|
|
| 103 |
|
| 104 |
## Quick start
|
| 105 |
|
| 106 |
+
This cell ships with a 125-class libmagic-MIME classifier head baked in (the paper's LR probe, re-fit on the full magic-files corpus), so `pipeline("text-classification", ...)` works out of the box:
|
| 107 |
|
| 108 |
```python
|
| 109 |
from transformers import pipeline
|
|
|
|
| 119 |
# [{"label": "image/png", "score": 0.97}, {"label": "image/jpeg", "score": 0.01}, ...]
|
| 120 |
```
|
| 121 |
|
| 122 |
+
To work with embeddings directly (fit a probe, kNN over a gallery, fine-tune a head):
|
| 123 |
|
| 124 |
```python
|
| 125 |
import torch
|
|
|
|
| 136 |
embedding = model(**inputs).pooler_output # (1, 512)
|
| 137 |
```
|
| 138 |
|
| 139 |
+
|
| 140 |
---
|
| 141 |
|
| 142 |
## Recommended deployment regimes
|
| 143 |
|
| 144 |
+
- **Fine-grained libmagic-taxonomy classification from a clean 4 KB chunk**: headline cell of the paper.
|
| 145 |
- General-purpose deployment when one cell must serve mixed content (image + text + binary).
|
| 146 |
|
| 147 |
---
|
|
|
|
| 164 |
- This is one cell of a 28-cell cube. Within-cube comparisons in the paper carry bootstrap CIs at n=3 seeds; some marginal orderings (byte vs bpe-16k at the top of medium) are within seed noise and should be read as ties.
|
| 165 |
- The training corpus is one 33 GB stratified multi-source binary sample. Results may not transfer to substantially different corpora.
|
| 166 |
- All numbers are computed on data labelled by a single pipeline (libmagic-pinned). Cross-validation against PRONOM, Siegfried, DROID, or IANA reference files is a documented limitation.
|
| 167 |
+
- CPU latency at the `medium` size is ~348Γ slower than Magika v1.1. For sub-millisecond whole-file triage on broad categories, Magika is purpose-built and is the right tool. MimeLens occupies a different point on the deployment surface (position-arbitrary inputs + libmagic's 125-class taxonomy), not a drop-in replacement.
|
| 168 |
- End-to-end fine-tuning on the production label distribution may shift these numbers and should be evaluated before deployment. The frozen-probe numbers above are not claimed as a lower bound on fine-tuned performance.
|
| 169 |
|
| 170 |
---
|