Text Generation
Transformers
Safetensors
English
Chinese
multilingual
qwen3
pruning
layer-pruning
laco
compressed
llm
efficient
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use Mercity/Qwen3-8B-LaCo-Pruned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Mercity/Qwen3-8B-LaCo-Pruned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Mercity/Qwen3-8B-LaCo-Pruned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Mercity/Qwen3-8B-LaCo-Pruned") model = AutoModelForCausalLM.from_pretrained("Mercity/Qwen3-8B-LaCo-Pruned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Mercity/Qwen3-8B-LaCo-Pruned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Mercity/Qwen3-8B-LaCo-Pruned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mercity/Qwen3-8B-LaCo-Pruned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Mercity/Qwen3-8B-LaCo-Pruned
- SGLang
How to use Mercity/Qwen3-8B-LaCo-Pruned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Mercity/Qwen3-8B-LaCo-Pruned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mercity/Qwen3-8B-LaCo-Pruned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Mercity/Qwen3-8B-LaCo-Pruned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mercity/Qwen3-8B-LaCo-Pruned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Mercity/Qwen3-8B-LaCo-Pruned with Docker Model Runner:
docker model run hf.co/Mercity/Qwen3-8B-LaCo-Pruned
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-8B-Base | |
| arvix: arxiv:2507.02279 | |
| tags: | |
| - pruning | |
| - layer-pruning | |
| - laco | |
| - compressed | |
| - qwen3 | |
| - llm | |
| - efficient | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| language: | |
| - en | |
| - zh | |
| - multilingual | |
| datasets: | |
| - wikipedia | |
| model-index: | |
| - name: Qwen3-8B-LaCo-Pruned | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: HellaSwag | |
| type: hellaswag | |
| metrics: | |
| - type: accuracy_norm | |
| value: 48.52 | |
| name: Accuracy (Normalized) | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: PIQA | |
| type: piqa | |
| metrics: | |
| - type: accuracy_norm | |
| value: 65.67 | |
| name: Accuracy (Normalized) | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: BoolQ | |
| type: boolq | |
| metrics: | |
| - type: accuracy | |
| value: 61.77 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: MMLU | |
| type: mmlu | |
| metrics: | |
| - type: accuracy | |
| value: 25.12 | |
| name: Accuracy (5-shot) | |
| # Qwen3-8B-LaCo-Pruned | |
| This model is a **layer-pruned** version of [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) using the [LaCo (Layer Collapse)](https://arxiv.org/abs/2402.11187) structured pruning method. | |
| ## Model Summary | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | **Base Model** | [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) | | |
| | **Pruning Method** | LaCo (Layer Collapse) | | |
| | **Original Layers** | 36 | | |
| | **Pruned Layers** | 26 | | |
| | **Layers Removed** | 10 | | |
| | **Compression** | 27.8% | | |
| | **Parameters** | ~5.8B (reduced from ~8B) | | |
| ## Intended Use | |
| - **Research** on model compression and efficiency | |
| - **Fine-tuning base** for domain-specific applications | |
| - **Inference optimization** where speed/memory matters more than factual accuracy | |
| - **Edge deployment** scenarios with limited computational resources | |
| ## ⚠️ Important Limitations | |
| This pruned model has **significantly reduced factual knowledge capabilities**. It performs at near-random levels on knowledge-intensive benchmarks like MMLU. | |
| | Use Case | Status | | |
| |----------|--------| | |
| | Physical reasoning tasks | ✅ Good (82.6% retained) | | |
| | Reading comprehension | ⚠️ Acceptable (74.3% retained) | | |
| | Common sense reasoning | ⚠️ Degraded (61.8% retained) | | |
| | Factual question answering | ❌ Not recommended | | |
| | Knowledge-intensive tasks | ❌ Not recommended | | |
| **Recommendation:** Fine-tune this model on your target domain before deployment. | |
| --- | |
| ## Pruning Details | |
| ### LaCo Hyperparameters | |
| | Parameter | Value | Description | | |
| |-----------|-------|-------------| | |
| | MERGE_LAYERS (C) | 3 | Layers merged per operation | | |
| | LOWEST_LAY (L) | 4 | Minimum layer index for merging | | |
| | HIGHEST_LAY (H) | 28 | Maximum layer index for merging | | |
| | INTERVAL (I) | 2 | Minimum gap between merge points | | |
| | THRESHOLD (T) | 0.85 | Cosine similarity threshold | | |
| | MAX_COMPRESSION | 30% | Maximum allowed compression | | |
| ### Pruning Statistics | |
| | Metric | Value | | |
| |--------|-------| | |
| | Successful Merges | 5 | | |
| | Rejected Merges | 0 | | |
| | Total Iterations | 6 | | |
| | Final Compression | 27.8% | | |
| ### Hidden State Similarity (Calibration Set) | |
| | Metric | Value | | |
| |--------|-------| | |
| | Average | 0.9680 | | |
| | Min | 0.9492 | | |
| | Max | 0.9766 | | |
| Individual similarities: `[0.9492, 0.9727, 0.9609, 0.9766, 0.9688, 0.9648, 0.9648, 0.9766, 0.9727, 0.9727]` | |
| ### Perplexity Results | |
| | Model | Perplexity | Ratio | | |
| |-------|------------|-------| | |
| | Original (Qwen3-8B-Base) | 26.19 | 1.00× | | |
| | Pruned (this model) | 71.48 | **2.73×** | | |
| --- | |
| ## Benchmark Results | |
| ### Comparison with Original Qwen3-8B-Base | |
| | Benchmark | Original | Pruned | Retention | Status | | |
| |-----------|----------|--------|-----------|--------| | |
| | **PIQA** | 79.54% | 65.67% | 82.6% | ✅ Good | | |
| | **BoolQ** | 83.09% | 61.77% | 74.3% | ⚠️ Acceptable | | |
| | **HellaSwag** | 78.55% | 48.52% | 61.8% | ⚠️ Degraded | | |
| | **MMLU (5-shot)** | 76.89% | 25.12% | 32.7% | ❌ Near random | | |
| *Original scores from [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)* | |
| ### Key Findings | |
| 1. **Physical reasoning preserved:** PIQA retained 82.6% of original performance | |
| 2. **Factual knowledge destroyed:** MMLU collapsed to random-chance (25% for 4-way MCQ) | |
| 3. **Perplexity underestimates damage:** 2.73× PPL ratio doesn't predict the benchmark collapse | |
| 4. **Layer-specific knowledge:** Factual knowledge appears encoded in specific removed layers | |
| --- | |
| ## Usage | |
| ### Basic Inference | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "Mercity/Qwen3-8B-LaCo-Pruned" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| # Text generation | |
| prompt = "The process of photosynthesis" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### With 4-bit Quantization (Further Compression) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| quantization_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_compute_dtype="float16", | |
| bnb_4bit_quant_type="nf4", | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Mercity/Qwen3-8B-LaCo-Pruned", | |
| quantization_config=quantization_config, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| ``` | |
| --- | |
| ## Recovery Recommendations | |
| To restore performance after pruning: | |
| ### Option 1: LoRA Fine-tuning (Recommended) | |
| ```python | |
| from peft import LoraConfig, get_peft_model | |
| lora_config = LoraConfig( | |
| r=32, | |
| lora_alpha=64, | |
| target_modules=["q_proj", "k_proj", "v_proj", "o_proj", | |
| "gate_proj", "up_proj", "down_proj"], | |
| lora_dropout=0.05, | |
| ) | |
| model = get_peft_model(model, lora_config) | |
| # Fine-tune on OpenOrca, Alpaca, or domain-specific data | |
| ``` | |
| ### Option 2: Knowledge Distillation | |
| Use original Qwen3-8B-Base as teacher to transfer knowledge back. | |
| ### Expected Recovery | |
| - With fine-tuning: +15-25% on MMLU | |
| - With knowledge distillation: +25-35% on MMLU | |
| --- | |
| ## Technical Specifications | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | Architecture | Transformer decoder-only | | |
| | Parameters | ~5.8B | | |
| | Layers | 26 | | |
| | Hidden Size | 4096 | | |
| | Attention Heads (Q) | 32 | | |
| | Attention Heads (KV) | 8 (GQA) | | |
| | Intermediate Size | 12288 | | |
| | Vocabulary Size | 151,669 | | |
| | Max Context Length | 32,768 tokens | | |
| | Precision | bfloat16 | | |
| --- | |
| ## Citation | |
| If you use this model, please cite the original LaCo paper and Qwen3: | |
| ```bibtex | |
| @article{yang2024laco, | |
| title={LaCo: Large Language Model Pruning via Layer Collapse}, | |
| author={Yang, Yifei and Cao, Zouying and Zhao, Hai}, | |
| journal={arXiv preprint arXiv:2402.11187}, | |
| year={2024} | |
| } | |
| @misc{qwen3technicalreport, | |
| title={Qwen3 Technical Report}, | |
| author={Qwen Team}, | |
| year={2025}, | |
| eprint={2505.09388}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2505.09388} | |
| } | |
| ``` | |
| ## References | |
| - [LaCo Paper](https://arxiv.org/abs/2402.11187) | |
| - [LaCo Official Implementation](https://github.com/yangyifei729/LaCo) | |
| - [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388) | |
| - [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) | |
| ## License | |
| Apache 2.0 (same as base Qwen3 model) | |
| ## Acknowledgments | |
| - Qwen Team for the excellent Qwen3-8B-Base model | |
| - LaCo authors for the pruning methodology | |
| - Hugging Face for model hosting |