Text Classification
Transformers
ONNX
Safetensors
English
mimelens
file-type-detection
mime-classification
binary-content
binary-analysis
position-agnostic
libmagic
forensics
packet-inspection
bpe
byte-pair-encoding
custom_code
Eval Results (legacy)
Instructions to use mjbommar/mimelens-001-medium-bpe-16k-s1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mjbommar/mimelens-001-medium-bpe-16k-s1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True)# Load model directly from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("mjbommar/mimelens-001-medium-bpe-16k-s1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
README: badge lines as bullets so each renders on its own line
Browse files
README.md
CHANGED
|
@@ -45,11 +45,11 @@ model-index:
|
|
| 45 |
|
| 46 |
A 37.76M-backbone-parameter BERT-style encoder for fine-grained file-content-type detection from binary data. Takes any 4 KB byte buffer (regardless of source offset) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
|
| 47 |
|
| 48 |
-
**π Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
|
| 49 |
-
**π₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; family hub forthcoming)
|
| 50 |
-
**π€ Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
|
| 51 |
-
**π Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026). [GitHub](https://github.com/mjbommar/binary-embedding-paper) (source release forthcoming)
|
| 52 |
-
**π Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-frags, glaurung, Windows drivers (33 GB stratified)
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
|
|
| 45 |
|
| 46 |
A 37.76M-backbone-parameter BERT-style encoder for fine-grained file-content-type detection from binary data. Takes any 4 KB byte buffer (regardless of source offset) and produces a 512-dimensional embedding that classifiers map to one of [libmagic](https://github.com/file/file)'s 125 MIME labels. Designed for inputs where you only have a chunk: a forensic-carved fragment, a random disk-block read, a streaming HTTP upload, a single network packet payload.
|
| 47 |
|
| 48 |
+
- **π Model**: [`mjbommar/mimelens-001-medium-bpe-16k-s1`](https://huggingface.co/mjbommar/mimelens-001-medium-bpe-16k-s1)
|
| 49 |
+
- **π₯ Family**: [`mjbommar/mimelens-001`](https://huggingface.co/mjbommar/mimelens-001) (28 pretrained cells; family hub forthcoming)
|
| 50 |
+
- **π€ Tokenizer**: [`mjbommar/binary-tokenizer-001-16k`](https://huggingface.co/mjbommar/binary-tokenizer-001-16k)
|
| 51 |
+
- **π Paper**: *MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection* (Bommarito 2026). [GitHub](https://github.com/mjbommar/binary-embedding-paper) (source release forthcoming)
|
| 52 |
+
- **π Pretraining corpus**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) plus magic-frags, glaurung, Windows drivers (33 GB stratified)
|
| 53 |
|
| 54 |
---
|
| 55 |
|