---
library_name: transformers
pipeline_tag: text-generation
tags:
- alienlm
- alien-adaptation-training
- tokenizer-bijection
- instruction-tuned
datasets:
- Magpie-Align/Magpie-Pro-300K-Filtered
- Magpie-Align/Magpie-Reasoning-V1-150K
model-index:
- name: Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama
  results: []
license: apache-2.0
base_model: Qwen/Qwen2.5-14B-Instruct
---

# Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama

This repository contains the `Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` weights used in the AlienLM experiments.
It is based on `Qwen/Qwen2.5-14B-Instruct` and was adapted with Alien Adaptation Training (AAT) on [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered), [Magpie-Align/Magpie-Reasoning-V1-150K](https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-V1-150K).

AlienLM is a research method for reducing human-readable plaintext exposure at the black-box API boundary.
It transforms text through a reversible vocabulary-level bijection before server-side processing, then relies on a
client-side inverse mapping to recover plaintext. These weights are intended for reproducing and analyzing the
paper's experiments, not as a production privacy or safety mechanism.

## Variant

- Variant: AlienLM full tokenizer-bijection adaptation
- Base model: `Qwen/Qwen2.5-14B-Instruct`
- Local source path used for upload: `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama`
- Weight source used for upload: `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama`
- Tokenizer check: The local tokenizer produced different token IDs from the base tokenizer for the test sentence.
Base tokenizer token IDs for the test sentence: `[2403, 6247, 8521, 525, 25992, 26, 1817, 42151, 2997, 374, 42151, 304, 1181, 1828, 1616, 13]`.

## Important Limitations

- AlienLM does not provide cryptographic security or formal privacy guarantees.
- The method is deterministic and should be evaluated under the relevant leakage and observer assumptions.
- Safety behavior can differ from the original instruction-tuned model; use this model for research evaluation only.
- Downstream quality depends on task, domain, alienization ratio, and adaptation data.

## Tokenization Example

Test sentence:

```text
All happy families are alike; each unhappy family is unhappy in its own way.
```

For this repository, the local tokenizer produces these visible token pieces:

```text
[All, Ġhappy, Ġfamilies, Ġare, Ġalike, ;, Ġeach, Ġunhappy, Ġfamily, Ġis, Ġunhappy, Ġin, Ġits, Ġown, Ġway, .]
```

The table below records how the same sentence maps to token IDs across the uploaded tokenizers. The visible token
pieces may look familiar because AlienLM changes the vocabulary-to-ID mapping; the ID sequence is the important
model-facing representation.

| Tokenizer | Source | Count | Token IDs |
|---|---:|---:|---|
| Base Qwen/Qwen2.5-7B-Instruct | `Qwen/Qwen2.5-7B-Instruct` | 16 | `[2403, 6247, 8521, 525, 25992, 26, 1817, 42151, 2997, 374, 42151, 304, 1181, 1828, 1616, 13]` |
| Base Qwen/Qwen2.5-14B-Instruct | `Qwen/Qwen2.5-14B-Instruct` | 16 | `[2403, 6247, 8521, 525, 25992, 26, 1817, 42151, 2997, 374, 42151, 304, 1181, 1828, 1616, 13]` |
| Gemma2-9b-it-AlienLM-50-all-tokenizer-v3-32-qwen | `/data2/AlienLM/outputs/Gemma2-9b-it-AlienLM-50-all-tokenizer-v3-32-qwen` | 16 | `[207114, 211985, 23904, 164425, 201838, 244780, 104844, 11896, 124750, 78043, 11896, 40818, 112321, 155972, 188431, 235269]` |
| Gemma2-9b-it-random42 | `/data2/AlienLM/outputs/Gemma2-9b-it-random42` | 16 | `[118082, 85241, 174135, 184646, 114599, 58746, 48064, 71689, 147487, 81724, 71689, 163116, 23867, 77693, 75944, 217666]` |
| Llama3-8B-Instruct-AlienLM-50-all-tokenizer-v3-32-qwenv2 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-50-all-tokenizer-v3-32-qwenv2/checkpoint-9306` | 16 | `[4054, 43251, 60004, 66417, 35331, 114100, 27381, 6380, 39185, 23136, 6380, 109132, 8299, 21649, 82386, 11]` |
| Llama3-8B-Instruct-AlienLM-ratio-20 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-20` | 16 | `[2460, 6380, 8689, 527, 27083, 26, 1855, 24241, 30235, 374, 24241, 23136, 1202, 1866, 1648, 13]` |
| Llama3-8B-Instruct-AlienLM-ratio-40 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-40` | 16 | `[8140, 43251, 50556, 527, 27083, 114100, 27381, 6380, 15547, 18115, 6380, 304, 996, 1866, 1648, 13]` |
| Llama3-8B-Instruct-AlienLM-ratio-60 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-60` | 16 | `[4054, 43251, 8689, 527, 27083, 114100, 27381, 6380, 3070, 40584, 6380, 304, 82321, 16244, 52224, 11]` |
| Llama3-8B-Instruct-AlienLM-ratio-80 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-80` | 16 | `[4054, 43251, 60004, 66417, 35331, 26, 27381, 6380, 39185, 48649, 6380, 304, 1202, 1961, 1648, 11]` |
| Llama3-8B-Instruct-random-42 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-random-42/checkpoint-9306` | 16 | `[109112, 64630, 115549, 88947, 56261, 123661, 98632, 89092, 51180, 49115, 89092, 76847, 27799, 22779, 121871, 33744]` |
| Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama | `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` | 16 | `[90633, 42151, 58904, 2804, 90614, 25, 272, 6247, 29135, 282, 6247, 293, 386, 94648, 28766, 11]` |
| Qwen25-14b-Instruct-random-42 | `/data2/AlienLM/outputs/Qwen25-14b-Instruct-random-42` | 16 | `[26430, 9244, 81484, 117800, 1086, 89842, 70268, 27147, 15693, 31326, 27147, 21062, 67902, 77163, 56354, 63835]` |
| Qwen25-7b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama | `/data2/AlienLM/outputs/Qwen25-7b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` | 16 | `[90633, 42151, 58904, 2804, 90614, 25, 272, 6247, 29135, 282, 6247, 293, 386, 94648, 28766, 11]` |
| Qwen25-7b-Instruct-random-42 | `/data2/AlienLM/outputs/Qwen25-7b-Instruct-random-42` | 16 | `[26430, 9244, 81484, 117800, 1086, 89842, 70268, 27147, 15693, 31326, 27147, 21062, 67902, 77163, 56354, 63835]` |

## Uploaded Files

Only serving-time artifacts were staged for upload:

- `added_tokens.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/added_tokens.json`
- `config.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/config.json`
- `generation_config.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/generation_config.json`
- `merges.txt` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/merges.txt`
- `model-00001-of-00002.safetensors` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/model-00001-of-00002.safetensors`
- `model-00002-of-00002.safetensors` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/model-00002-of-00002.safetensors`
- `model.safetensors.index.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/model.safetensors.index.json`
- `special_tokens_map.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/special_tokens_map.json`
- `tokenizer.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/tokenizer.json`
- `tokenizer_config.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/tokenizer_config.json`
- `vocab.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/vocab.json`

Training-only artifacts such as `checkpoint-*` directories, `trainer_state.json`, optimizer states, scheduler states,
RNG states, logs, caches, and W&B files were intentionally excluded.

## Training Data

The model was adapted on the Magpie instruction and reasoning mixture used in the AlienLM experiments:

- `Magpie-Align/Magpie-Pro-300K-Filtered`
- `Magpie-Align/Magpie-Reasoning-V1-150K`

## Citation

If you use these weights, please cite the AlienLM paper.