| --- |
| language: en |
| license: apache-2.0 |
| --- |
| |
| # BERT Hash Nano Models |
|
|
| This is a set of 3 Nano [BERT](https://arxiv.org/abs/1810.04805) models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://arxiv.org/abs/2405.19504). |
|
|
| The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size. |
|
|
| This significantly reduces the number of parameters necessary for token embeddings. |
|
|
| For example: |
|
|
| Standard token embeddings: |
| - 30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters |
| - 23,440,896 x 4 (float32) = 93,763,584 bytes |
|
|
| Hash token embeddings: |
| - 30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters |
| - 156,450 x 4 (float32) = 625,800 bytes |
|
|
| These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). |
|
|
| Here is a subset of GLUE scores on the dev set using the [script provided by Hugging Face Transformers](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) with the following parameters. |
|
|
| ```bash |
| python run_glue.py --model_name_or_path <model path> --task_name <task name> --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True |
| ``` |
|
|
| | Model | Parameters | MNLI (acc m/mm) | MRPC (f1/acc) | SST-2 (acc) | |
| | ----- | ---------- | --------------- | ---------------- | ----------- | |
| | [baseline (bert-tiny)](https://hf.co/google/bert_uncased_L-2_H-128_A-2) | 4.4M | 0.7114 / 0.7161 | 0.8318 / 0.7353 | 0.8222 | |
| | [bert-hash-femto](https://hf.co/neuml/bert-hash-femto) | 0.243M | 0.5697 / 0.5750 | 0.8122 / 0.6838 | 0.7821 | |
| | [bert-hash-pico](https://hf.co/neuml/bert-hash-pico) | 0.448M | 0.6228 / 0.6363 | 0.8205 / 0.7083 | 0.7878 | |
| | [**bert-hash-nano**](https://hf.co/neuml/bert-hash-nano) | **0.969M** | **0.6565 / 0.6670** | **0.8172 / 0.7083** | **0.8131** | |
|
|
| ## Usage |
|
|
| These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, `trust_remote_code` needs to be set. |
|
|
| ```python |
| from transformers import AutoModel |
| |
| model = AutoModel.from_pretrained("neuml/bert-hash-nano", trust_remote_code=True) |
| ``` |
|
|
| ## Training |
|
|
| Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using [txtai](https://github.com/neuml/txtai). |
|
|
| ```python |
| from datasets import concatenate_datasets, load_dataset |
| from transformers import AutoTokenizer |
| |
| from txtai.pipeline import HFTrainer |
| |
| from configuration_bert_hash import * |
| from modeling_bert_hash import * |
| |
| dataset = load_dataset("path to target HF dataset") |
| |
| tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") |
| |
| config = BertHashConfig( |
| hidden_size=128, |
| num_hidden_layers=2, |
| num_attention_heads=2, |
| intermediate_size=512, |
| projections=16 |
| ) |
| model = BertHashForMaskedLM(config) |
| |
| print(config) |
| print("Total parameters:", sum(p.numel() for p in model.bert.parameters())) |
| |
| train = HFTrainer() |
| |
| # Train using MLM |
| train((model, tokenizer), dataset, task="language-modeling", output_dir="model", |
| fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3, |
| warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6, |
| tokenizers=True, dataloader_num_workers=20, |
| save_strategy="steps", save_steps=5000, logging_steps=500, |
| ) |
| ``` |
|
|
| ## Future Work |
|
|
| This model demonstrates that smaller models can still be productive models. |
|
|
| The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs. |
|
|
| Imagine more specialized models like this for medical, legal, science and more. |
|
|