adrianMT56
/

aya-enes-I2-8

machine-translation

interpretability

Model card Files Files and versions

aya-enes-I2-8 / README.md

adrianMT56's picture

Upload folder using huggingface_hub

8050ff8 verified 2 months ago

|

History Blame Contribute Delete

2.07 kB

	---
	language: [en, es]
	license: cc-by-nc-4.0
	base_model: CohereForAI/aya-expanse-8b
	tags:
	- translation
	- machine-translation
	- aya-expanse
	- layer-pruning
	- interpretability
	pipeline_tag: translation
	---

	# aya-enes-I2-8

	English -> Spanish translation model derived from
	[CohereForAI/aya-expanse-8b](https://huggingface.co/CohereForAI/aya-expanse-8b)
	(32 layers, 8B parameters).

	## Recipe

	IFR-guided layer pruning (8 middle layers removed), LoRA fine-tuning + knowledge distillation from Aya-Expanse 32B.

	- Number of transformer layers: 24 (of the original 32)
	- Layers removed: `[8, 10, 11, 12, 13, 14, 15, 16]`
	- Pruning method: IFR (Information Flow Routes)
	- Fine-tuning: LoRA (r=16, alpha=32), 3 epochs on News Commentary v18 en-es
	- Distillation: synthetic translations from Aya-Expanse 32B, filtered to COMET >= 0.7
	- Precision: fp16

	## Evaluation

	Evaluated on 500 held-out News Commentary v18 en-es sentences.

	\| Metric \| Value \|
	\|--------\|------:\|
	\| COMET (wmt22-comet-da) \| 0.8880 \|
	\| chrF++ \| 67.13 \|
	\| BLEU \| 46.02 \|

	## Usage

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	name = "adrianMT56/aya-enes-I2-8"
	tokenizer = AutoTokenizer.from_pretrained(name)
	model = AutoModelForCausalLM.from_pretrained(name, dtype=torch.float16)

	prompt = ("Translate the following English text to Spanish.\n\n"
	"English: The quick brown fox jumps over the lazy dog.\n"
	"Spanish:")
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
	print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	CPU users can omit `dtype=torch.float16` (defaults to float32) or leave it as fp16
	at the cost of some throughput. For GPTQ 4-bit conversion see the project's
	`scripts/quantize_to_gptq.py`.

	## Reproducibility

	This checkpoint was produced by the pipeline at
	<https://github.com/adrianMT56/attention_lp>.
	See `README.md` in that repo for the full training recipe and evaluation scripts.