Instructions to use tripathyShaswata/sarvam-1-v0.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tripathyShaswata/sarvam-1-v0.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tripathyShaswata/sarvam-1-v0.5-GGUF",
	filename="sarvam-1-v0.5-Q8_0.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use tripathyShaswata/sarvam-1-v0.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

Use Docker

docker model run hf.co/tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

LM Studio
Jan

vLLM

How to use tripathyShaswata/sarvam-1-v0.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tripathyShaswata/sarvam-1-v0.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tripathyShaswata/sarvam-1-v0.5-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

Ollama
How to use tripathyShaswata/sarvam-1-v0.5-GGUF with Ollama:
```
ollama run hf.co/tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0
```

Unsloth Studio

How to use tripathyShaswata/sarvam-1-v0.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tripathyShaswata/sarvam-1-v0.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tripathyShaswata/sarvam-1-v0.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for tripathyShaswata/sarvam-1-v0.5-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use tripathyShaswata/sarvam-1-v0.5-GGUF with Docker Model Runner:
```
docker model run hf.co/tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0
```

Lemonade

How to use tripathyShaswata/sarvam-1-v0.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull tripathyShaswata/sarvam-1-v0.5-GGUF:Q8_0

Run and chat with the model

lemonade run user.sarvam-1-v0.5-GGUF-Q8_0

List all available models

lemonade list

Sarvam-1-v0.5 GGUF

GGUF quantized versions of sarvamai/sarvam-1-v0.5 for local inference with llama.cpp, Ollama, LM Studio, and GPT4All.

Sarvam-1 is an Indian multilingual LLM built by Sarvam AI — supporting 22 Indian languages including Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Punjabi, Gujarati, and Odia. Based on Llama architecture with 3.1B parameters.

Available Quantizations

File	Quant	Size	RAM Needed	Use Case
`sarvam-1-v0.5-Q8_0.gguf`	Q8_0	2.5 GB	~4 GB	Best quality, near-lossless
`sarvam-1-v0.5-f16.gguf`	F16	4.7 GB	~6 GB	Full precision, maximum quality

How to Use

With llama.cpp

./llama-cli -m sarvam-1-v0.5-Q8_0.gguf -p "भारत की राजधानी क्या है?" -n 256

With Ollama

# Create a Modelfile
echo 'FROM ./sarvam-1-v0.5-Q8_0.gguf' > Modelfile
ollama create sarvam -f Modelfile
ollama run sarvam

With LM Studio

Download the Q8_0 file
Open LM Studio → Load Model → Select the file
Start chatting in English or any supported Indian language

Model Details

Architecture: Llama
Parameters: 3.1B
Hidden Size: 2048
Layers: 28
Attention Heads: 16
Context Length: Check original model card
Languages: English + 22 Indian languages (Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Punjabi, Gujarati, Odia, and more)
License: Apache 2.0

Original Model

Built by Sarvam AI — India's leading AI research company. See the original model at sarvamai/sarvam-1-v0.5.

Quantized by

Shaswata Tripathy | GitHub | Medium | LinkedIn | Hugging Face

Downloads last month: 11

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

8-bit

16-bit

Model tree for tripathyShaswata/sarvam-1-v0.5-GGUF

Base model

sarvamai/sarvam-1-v0.5

Quantized

(8)

this model