Instructions to use nota-ai/st-vicuna-v1.3-5.5b-ppl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nota-ai/st-vicuna-v1.3-5.5b-ppl with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nota-ai/st-vicuna-v1.3-5.5b-ppl")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nota-ai/st-vicuna-v1.3-5.5b-ppl")
model = AutoModelForMultimodalLM.from_pretrained("nota-ai/st-vicuna-v1.3-5.5b-ppl")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nota-ai/st-vicuna-v1.3-5.5b-ppl with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nota-ai/st-vicuna-v1.3-5.5b-ppl"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nota-ai/st-vicuna-v1.3-5.5b-ppl",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nota-ai/st-vicuna-v1.3-5.5b-ppl

SGLang

How to use nota-ai/st-vicuna-v1.3-5.5b-ppl with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nota-ai/st-vicuna-v1.3-5.5b-ppl" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nota-ai/st-vicuna-v1.3-5.5b-ppl",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nota-ai/st-vicuna-v1.3-5.5b-ppl" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nota-ai/st-vicuna-v1.3-5.5b-ppl",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nota-ai/st-vicuna-v1.3-5.5b-ppl with Docker Model Runner:
```
docker model run hf.co/nota-ai/st-vicuna-v1.3-5.5b-ppl
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Shortened LLaMA Model Card

Shortened LLaMA is a depth-pruned version of LLaMA models & variants for efficient text generation.

Developed by: Nota AI
License: Non-commercial license
Repository: https://github.com/Nota-NetsPresso/shortened-llm
Paper: https://arxiv.org/abs/2402.02834

Compression Method

After identifying unimportant Transformer blocks, we perform one-shot pruning and light LoRA-based retraining.

Click to see a method figure.

Model Links

Source Model	Pruning Ratio	Pruning Criterion	HF Models Link
LLaMA-1-7B	20%	PPL	nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B	20%	Taylor+	nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B	20%	PPL	nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B	20%	Taylor+	nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B	21%	PPL	nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B	21%	Taylor+	nota-ai/st-vicuna-v1.3-10.5b-taylor

Zero-shot Performance & Efficiency Results

EleutherAI/lm-evaluation-harness version 3326c54

License

All rights related to this repository and the compressed models are reserved by Nota Inc.
The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

LLM-Pruner, which utilizes LM Evaluation Harness, PEFT, and Alpaca-LoRA. Thanks for the pioneering work on structured pruning of LLMs!
Meta AI's LLaMA and LMSYS Org's Vicuna. Thanks for the open-source LLMs!

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}

@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}

Downloads last month: 4

Safetensors

Model size

6B params

Tensor type

F16

Model tree for nota-ai/st-vicuna-v1.3-5.5b-ppl

Quantizations

2 models

Collection including nota-ai/st-vicuna-v1.3-5.5b-ppl

Efficient Large Language Model

Collection

Shortened LLMs from Depth Pruning; https://github.com/Nota-NetsPresso/shortened-llm • 14 items • Updated Apr 2, 2025 • 3

Paper for nota-ai/st-vicuna-v1.3-5.5b-ppl

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

Paper • 2402.02834 • Published Feb 5, 2024 • 17