Text Generation
Transformers
Safetensors
English
Chinese
multilingual
qwen3
pruning
layer-pruning
laco
compressed
llm
efficient
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use Mercity/Qwen3-8B-LaCo-30L with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Mercity/Qwen3-8B-LaCo-30L with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Mercity/Qwen3-8B-LaCo-30L") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Mercity/Qwen3-8B-LaCo-30L") model = AutoModelForCausalLM.from_pretrained("Mercity/Qwen3-8B-LaCo-30L") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Mercity/Qwen3-8B-LaCo-30L with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Mercity/Qwen3-8B-LaCo-30L" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mercity/Qwen3-8B-LaCo-30L", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Mercity/Qwen3-8B-LaCo-30L
- SGLang
How to use Mercity/Qwen3-8B-LaCo-30L with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Mercity/Qwen3-8B-LaCo-30L" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mercity/Qwen3-8B-LaCo-30L", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Mercity/Qwen3-8B-LaCo-30L" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mercity/Qwen3-8B-LaCo-30L", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Mercity/Qwen3-8B-LaCo-30L with Docker Model Runner:
docker model run hf.co/Mercity/Qwen3-8B-LaCo-30L
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-8B-Base | |
| tags: | |
| - pruning | |
| - layer-pruning | |
| - laco | |
| - compressed | |
| - qwen3 | |
| - llm | |
| - efficient | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| language: | |
| - en | |
| - zh | |
| - multilingual | |
| datasets: | |
| - wikipedia | |
| model-index: | |
| - name: Qwen3-8B-LaCo-Pruned | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: PIQA | |
| type: piqa | |
| metrics: | |
| - type: accuracy_norm | |
| value: 71.38 | |
| name: Accuracy (Normalized) | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: HellaSwag | |
| type: hellaswag | |
| metrics: | |
| - type: accuracy_norm | |
| value: 61.98 | |
| name: Accuracy (Normalized) | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: BoolQ | |
| type: boolq | |
| metrics: | |
| - type: accuracy | |
| value: 64.95 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: WinoGrande | |
| type: winogrande | |
| metrics: | |
| - type: accuracy | |
| value: 62.83 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: ARC-Challenge | |
| type: arc_challenge | |
| metrics: | |
| - type: accuracy_norm | |
| value: 36.09 | |
| name: Accuracy (Normalized) | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: ARC-Easy | |
| type: arc_easy | |
| metrics: | |
| - type: accuracy_norm | |
| value: 58.04 | |
| name: Accuracy (Normalized) | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: MMLU | |
| type: mmlu | |
| metrics: | |
| - type: accuracy | |
| value: 31.30 | |
| name: Accuracy (5-shot) | |
| # Qwen3-8B-LaCo-Pruned | |
| This model is a **layer-pruned** version of [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) using the [LaCo (Layer Collapse)](https://arxiv.org/abs/2402.11187) structured pruning method. | |
| ## Model Summary | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | **Base Model** | [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) | | |
| | **Pruning Method** | LaCo (Layer Collapse) | | |
| | **Original Layers** | 36 | | |
| | **Pruned Layers** | 30 | | |
| | **Layers Removed** | 6 | | |
| | **Compression** | 16.7% | | |
| | **Parameters** | ~6.7B (reduced from ~8B) | | |
| ## Key Results | |
| This model achieves **16.7% compression** while retaining: | |
| - **~90% of physical reasoning** (PIQA) | |
| - **~94% of commonsense reasoning** (WinoGrande) | |
| - **~79% of common sense completion** (HellaSwag) | |
| - **~41% of factual knowledge** (MMLU) | |
| This is a **raw pruned model without post-training**. Fine-tuning can further recover lost capabilities. | |
| --- | |
| ## Benchmark Results (Pre-Training) | |
| **Note:** All benchmarks below are evaluated on the pruned model **without any post-training or fine-tuning**. These results represent the raw performance after pruning only. Post-training is expected to improve these scores, particularly on knowledge-intensive tasks like MMLU. | |
| ### Comparison with Original Qwen3-8B-Base | |
| | Benchmark | Original | Pruned | Retention | | |
| |-----------|----------|--------|-----------| | |
| | **PIQA** (acc_norm) | 79.54% | 71.38% | 89.7% | | |
| | **WinoGrande** | 67.0% | 62.83% | 93.8% | | |
| | **ARC-Challenge** (acc_norm) | 42.0% | 36.09% | 85.9% | | |
| | **ARC-Easy** (acc_norm) | 72.0% | 58.04% | 80.6% | | |
| | **HellaSwag** (acc_norm) | 78.55% | 61.98% | 78.9% | | |
| | **BoolQ** | 83.09% | 64.95% | 78.2% | | |
| | **MMLU** (5-shot) | 76.89% | 31.30% | 40.7% | | |
| *Original scores from [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)* | |
| ### Benchmark Interpretation | |
| | Capability | Benchmarks | Retention | Status | | |
| |------------|------------|-----------|--------| | |
| | Physical Reasoning | PIQA | 89.7% | Excellent | | |
| | Commonsense Reasoning | WinoGrande | 93.8% | Excellent | | |
| | Basic Reasoning | ARC-Challenge | 85.9% | Good | | |
| | Reading Comprehension | BoolQ | 78.2% | Good | | |
| | Common Sense | HellaSwag | 78.9% | Good | | |
| | Factual Knowledge | MMLU | 40.7% | Degraded | | |
| --- | |
| ## The "Knowledge Cliff" | |
| Our experiments reveal a critical finding: **factual knowledge collapses catastrophically between 16-22% compression**. | |
| | Compression | Layers | MMLU | Status | | |
| |-------------|--------|------|--------| | |
| | **16.7%** | **30** | **31.30%** | Partial retention | | |
| | 22.2% | 28 | 25.89% | Random chance | | |
| | 27.8% | 26 | 25.12% | Random chance | | |
| While reasoning capabilities degrade gradually with compression, factual knowledge encoded in specific layers is lost abruptly when those layers are removed. | |
| --- | |
| ## Intended Use | |
| This model is suitable for: | |
| - **Research** on model compression and efficiency | |
| - **Fine-tuning base** for domain-specific applications | |
| - **Inference optimization** where speed/memory matters | |
| - **Applications prioritizing reasoning over factual recall** | |
| ## Limitations | |
| **Important:** This is a raw pruned model without post-training. | |
| | Use Case | Recommendation | | |
| |----------|----------------| | |
| | Physical/commonsense reasoning | Recommended | | |
| | Reading comprehension | Recommended | | |
| | General text understanding | Recommended | | |
| | Factual question answering | Fine-tune first | | |
| | Knowledge-intensive tasks | Fine-tune first | | |
| --- | |
| ## Pruning Details | |
| ### LaCo Hyperparameters | |
| | Parameter | Value | Description | | |
| |-----------|-------|-------------| | |
| | MERGE_LAYERS (C) | 3 | Layers merged per operation | | |
| | LOWEST_LAY (L) | 4 | Minimum layer index for merging | | |
| | HIGHEST_LAY (H) | 28 | Maximum layer index for merging | | |
| | INTERVAL (I) | 2 | Minimum gap between merge points | | |
| | THRESHOLD (T) | 0.85 | Cosine similarity threshold | | |
| | MAX_COMPRESSION | 20% | Maximum allowed compression | | |
| ### Pruning Statistics | |
| | Metric | Value | | |
| |--------|-------| | |
| | Successful Merges | 3 | | |
| | Rejected Merges | 0 | | |
| | Total Iterations | 4 | | |
| | Final Compression | 16.7% | | |
| --- | |
| ## Usage | |
| ### Basic Inference | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "Mercity/Qwen3-8B-LaCo-Pruned" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| # Text generation | |
| prompt = "The process of photosynthesis" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### With 4-bit Quantization (Further Compression) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| quantization_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_compute_dtype="float16", | |
| bnb_4bit_quant_type="nf4", | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Mercity/Qwen3-8B-LaCo-Pruned", | |
| quantization_config=quantization_config, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| ``` | |
| --- | |
| ## Recovery Recommendations | |
| To improve factual knowledge after pruning: | |
| ### LoRA Fine-tuning (Recommended) | |
| ```python | |
| from peft import LoraConfig, get_peft_model | |
| lora_config = LoraConfig( | |
| r=32, | |
| lora_alpha=64, | |
| target_modules=["q_proj", "k_proj", "v_proj", "o_proj", | |
| "gate_proj", "up_proj", "down_proj"], | |
| lora_dropout=0.05, | |
| ) | |
| model = get_peft_model(model, lora_config) | |
| # Fine-tune on OpenOrca, Alpaca, or domain-specific data | |
| ``` | |
| **Expected recovery:** MMLU could reach 45-55% with fine-tuning. | |
| --- | |
| ## Technical Specifications | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | Architecture | Transformer decoder-only | | |
| | Parameters | ~6.7B | | |
| | Layers | 30 | | |
| | Hidden Size | 4096 | | |
| | Attention Heads (Q) | 32 | | |
| | Attention Heads (KV) | 8 (GQA) | | |
| | Intermediate Size | 12288 | | |
| | Vocabulary Size | 151,669 | | |
| | Max Context Length | 32,768 tokens | | |
| | Precision | bfloat16 | | |
| --- | |
| ## Citation | |
| If you use this model, please cite the original LaCo paper and Qwen3: | |
| ```bibtex | |
| @article{yang2024laco, | |
| title={LaCo: Large Language Model Pruning via Layer Collapse}, | |
| author={Yang, Yifei and Cao, Zouying and Zhao, Hai}, | |
| journal={arXiv preprint arXiv:2402.11187}, | |
| year={2024} | |
| } | |
| @misc{qwen3technicalreport, | |
| title={Qwen3 Technical Report}, | |
| author={Qwen Team}, | |
| year={2025}, | |
| eprint={2505.09388}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2505.09388} | |
| } | |
| ``` | |
| ## References | |
| - [LaCo Paper](https://arxiv.org/abs/2402.11187) | |
| - [LaCo Official Implementation](https://github.com/yangyifei729/LaCo) | |
| - [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388) | |
| - [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) | |
| ## License | |
| Apache 2.0 (same as base Qwen3 model) | |
| ## Acknowledgments | |
| - Qwen Team for the excellent Qwen3-8B-Base model | |
| - LaCo authors for the pruning methodology | |
| - Hugging Face for model hosting |