Update README.md

49c2c35 verified 9 months ago

4.23 kB

	---
	language: en
	license: apache-2.0
	---

	# BERT Hash Nano Models

	This is a set of 3 Nano [BERT](https://arxiv.org/abs/1810.04805) models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://arxiv.org/abs/2405.19504).

	The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size.

	This significantly reduces the number of parameters necessary for token embeddings.

	For example:

	Standard token embeddings:
	- 30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters
	- 23,440,896 x 4 (float32) = 93,763,584 bytes

	Hash token embeddings:
	- 30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters
	- 156,450 x 4 (float32) = 625,800 bytes

	These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).

	Here is a subset of GLUE scores on the dev set using the [script provided by Hugging Face Transformers](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) with the following parameters.

	```bash
	python run_glue.py --model_name_or_path <model path> --task_name <task name> --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True
	```

	\| Model \| Parameters \| MNLI (acc m/mm) \| MRPC (f1/acc) \| SST-2 (acc) \|
	\| ----- \| ---------- \| --------------- \| ---------------- \| ----------- \|
	\| [baseline (bert-tiny)](https://hf.co/google/bert_uncased_L-2_H-128_A-2) \| 4.4M \| 0.7114 / 0.7161 \| 0.8318 / 0.7353 \| 0.8222 \|
	\| [bert-hash-femto](https://hf.co/neuml/bert-hash-femto) \| 0.243M \| 0.5697 / 0.5750 \| 0.8122 / 0.6838 \| 0.7821 \|
	\| [bert-hash-pico](https://hf.co/neuml/bert-hash-pico) \| 0.448M \| 0.6228 / 0.6363 \| 0.8205 / 0.7083 \| 0.7878 \|
	\| [bert-hash-nano](https://hf.co/neuml/bert-hash-nano) \| 0.969M \| 0.6565 / 0.6670 \| 0.8172 / 0.7083 \| 0.8131 \|

	## Usage

	These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, `trust_remote_code` needs to be set.

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained("neuml/bert-hash-nano", trust_remote_code=True)
	```

	## Training

	Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using [txtai](https://github.com/neuml/txtai).

	```python
	from datasets import concatenate_datasets, load_dataset
	from transformers import AutoTokenizer

	from txtai.pipeline import HFTrainer

	from configuration_bert_hash import *
	from modeling_bert_hash import *

	dataset = load_dataset("path to target HF dataset")

	tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

	config = BertHashConfig(
	hidden_size=128,
	num_hidden_layers=2,
	num_attention_heads=2,
	intermediate_size=512,
	projections=16
	)
	model = BertHashForMaskedLM(config)

	print(config)
	print("Total parameters:", sum(p.numel() for p in model.bert.parameters()))

	train = HFTrainer()

	# Train using MLM
	train((model, tokenizer), dataset, task="language-modeling", output_dir="model",
	fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3,
	warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6,
	tokenizers=True, dataloader_num_workers=20,
	save_strategy="steps", save_steps=5000, logging_steps=500,
	)
	```

	## Future Work

	This model demonstrates that smaller models can still be productive models.

	The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs.

	Imagine more specialized models like this for medical, legal, science and more.