--- library_name: transformers pipeline_tag: text-generation tags: - alienlm - alien-adaptation-training - tokenizer-bijection - instruction-tuned datasets: - Magpie-Align/Magpie-Pro-300K-Filtered - Magpie-Align/Magpie-Reasoning-V1-150K model-index: - name: Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama results: [] license: apache-2.0 base_model: Qwen/Qwen2.5-14B-Instruct --- # Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama This repository contains the `Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` weights used in the AlienLM experiments. It is based on `Qwen/Qwen2.5-14B-Instruct` and was adapted with Alien Adaptation Training (AAT) on [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered), [Magpie-Align/Magpie-Reasoning-V1-150K](https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-V1-150K). AlienLM is a research method for reducing human-readable plaintext exposure at the black-box API boundary. It transforms text through a reversible vocabulary-level bijection before server-side processing, then relies on a client-side inverse mapping to recover plaintext. These weights are intended for reproducing and analyzing the paper's experiments, not as a production privacy or safety mechanism. ## Variant - Variant: AlienLM full tokenizer-bijection adaptation - Base model: `Qwen/Qwen2.5-14B-Instruct` - Local source path used for upload: `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` - Weight source used for upload: `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` - Tokenizer check: The local tokenizer produced different token IDs from the base tokenizer for the test sentence. Base tokenizer token IDs for the test sentence: `[2403, 6247, 8521, 525, 25992, 26, 1817, 42151, 2997, 374, 42151, 304, 1181, 1828, 1616, 13]`. ## Important Limitations - AlienLM does not provide cryptographic security or formal privacy guarantees. - The method is deterministic and should be evaluated under the relevant leakage and observer assumptions. - Safety behavior can differ from the original instruction-tuned model; use this model for research evaluation only. - Downstream quality depends on task, domain, alienization ratio, and adaptation data. ## Tokenization Example Test sentence: ```text All happy families are alike; each unhappy family is unhappy in its own way. ``` For this repository, the local tokenizer produces these visible token pieces: ```text [All, Ġhappy, Ġfamilies, Ġare, Ġalike, ;, Ġeach, Ġunhappy, Ġfamily, Ġis, Ġunhappy, Ġin, Ġits, Ġown, Ġway, .] ``` The table below records how the same sentence maps to token IDs across the uploaded tokenizers. The visible token pieces may look familiar because AlienLM changes the vocabulary-to-ID mapping; the ID sequence is the important model-facing representation. | Tokenizer | Source | Count | Token IDs | |---|---:|---:|---| | Base Qwen/Qwen2.5-7B-Instruct | `Qwen/Qwen2.5-7B-Instruct` | 16 | `[2403, 6247, 8521, 525, 25992, 26, 1817, 42151, 2997, 374, 42151, 304, 1181, 1828, 1616, 13]` | | Base Qwen/Qwen2.5-14B-Instruct | `Qwen/Qwen2.5-14B-Instruct` | 16 | `[2403, 6247, 8521, 525, 25992, 26, 1817, 42151, 2997, 374, 42151, 304, 1181, 1828, 1616, 13]` | | Gemma2-9b-it-AlienLM-50-all-tokenizer-v3-32-qwen | `/data2/AlienLM/outputs/Gemma2-9b-it-AlienLM-50-all-tokenizer-v3-32-qwen` | 16 | `[207114, 211985, 23904, 164425, 201838, 244780, 104844, 11896, 124750, 78043, 11896, 40818, 112321, 155972, 188431, 235269]` | | Gemma2-9b-it-random42 | `/data2/AlienLM/outputs/Gemma2-9b-it-random42` | 16 | `[118082, 85241, 174135, 184646, 114599, 58746, 48064, 71689, 147487, 81724, 71689, 163116, 23867, 77693, 75944, 217666]` | | Llama3-8B-Instruct-AlienLM-50-all-tokenizer-v3-32-qwenv2 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-50-all-tokenizer-v3-32-qwenv2/checkpoint-9306` | 16 | `[4054, 43251, 60004, 66417, 35331, 114100, 27381, 6380, 39185, 23136, 6380, 109132, 8299, 21649, 82386, 11]` | | Llama3-8B-Instruct-AlienLM-ratio-20 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-20` | 16 | `[2460, 6380, 8689, 527, 27083, 26, 1855, 24241, 30235, 374, 24241, 23136, 1202, 1866, 1648, 13]` | | Llama3-8B-Instruct-AlienLM-ratio-40 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-40` | 16 | `[8140, 43251, 50556, 527, 27083, 114100, 27381, 6380, 15547, 18115, 6380, 304, 996, 1866, 1648, 13]` | | Llama3-8B-Instruct-AlienLM-ratio-60 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-60` | 16 | `[4054, 43251, 8689, 527, 27083, 114100, 27381, 6380, 3070, 40584, 6380, 304, 82321, 16244, 52224, 11]` | | Llama3-8B-Instruct-AlienLM-ratio-80 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-AlienLM-ratio-80` | 16 | `[4054, 43251, 60004, 66417, 35331, 26, 27381, 6380, 39185, 48649, 6380, 304, 1202, 1961, 1648, 11]` | | Llama3-8B-Instruct-random-42 | `/data2/AlienLM/outputs/Llama3-8B-Instruct-random-42/checkpoint-9306` | 16 | `[109112, 64630, 115549, 88947, 56261, 123661, 98632, 89092, 51180, 49115, 89092, 76847, 27799, 22779, 121871, 33744]` | | Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama | `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` | 16 | `[90633, 42151, 58904, 2804, 90614, 25, 272, 6247, 29135, 282, 6247, 293, 386, 94648, 28766, 11]` | | Qwen25-14b-Instruct-random-42 | `/data2/AlienLM/outputs/Qwen25-14b-Instruct-random-42` | 16 | `[26430, 9244, 81484, 117800, 1086, 89842, 70268, 27147, 15693, 31326, 27147, 21062, 67902, 77163, 56354, 63835]` | | Qwen25-7b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama | `/data2/AlienLM/outputs/Qwen25-7b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama` | 16 | `[90633, 42151, 58904, 2804, 90614, 25, 272, 6247, 29135, 282, 6247, 293, 386, 94648, 28766, 11]` | | Qwen25-7b-Instruct-random-42 | `/data2/AlienLM/outputs/Qwen25-7b-Instruct-random-42` | 16 | `[26430, 9244, 81484, 117800, 1086, 89842, 70268, 27147, 15693, 31326, 27147, 21062, 67902, 77163, 56354, 63835]` | ## Uploaded Files Only serving-time artifacts were staged for upload: - `added_tokens.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/added_tokens.json` - `config.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/config.json` - `generation_config.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/generation_config.json` - `merges.txt` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/merges.txt` - `model-00001-of-00002.safetensors` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/model-00001-of-00002.safetensors` - `model-00002-of-00002.safetensors` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/model-00002-of-00002.safetensors` - `model.safetensors.index.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/model.safetensors.index.json` - `special_tokens_map.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/special_tokens_map.json` - `tokenizer.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/tokenizer.json` - `tokenizer_config.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/tokenizer_config.json` - `vocab.json` from `/data2/AlienLM/outputs/Qwen25-14b-Instruct-AlienLM-50-all-tokenizer-v3-32-llama/vocab.json` Training-only artifacts such as `checkpoint-*` directories, `trainer_state.json`, optimizer states, scheduler states, RNG states, logs, caches, and W&B files were intentionally excluded. ## Training Data The model was adapted on the Magpie instruction and reasoning mixture used in the AlienLM experiments: - `Magpie-Align/Magpie-Pro-300K-Filtered` - `Magpie-Align/Magpie-Reasoning-V1-150K` ## Citation If you use these weights, please cite the AlienLM paper.