How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF:
Quick Links

DeepSeek Coder 7B Instruct v1.5 — GGUF (full quant collection)

GGUF conversions of deepseek-ai/deepseek-coder-7b-instruct-v1.5 for llama.cpp, LM Studio, KoboldCPP, llamafile, and other GGUF runtimes.

This repo publishes 14 quantization files (K-quants, IQ-quants, and Edmon02’s Q8_0). Pick a file by quality vs VRAM; see the table below.

Upstream DeepSeek Coder 7B Instruct v1.5
Context 4096 tokens
Architecture LLaMA-family (llama in GGUF)
Chat format ### Instruction / ### Response

Available GGUF files

File Quant ~Size VRAM (guide) Quality / speed
deepseek-coder-7b-instruct-v1.5.IQ3_XS.gguf IQ3_XS ~2.5 GB ~4 GB Smallest IQ; fastest
deepseek-coder-7b-instruct-v1.5.IQ3_S.gguf IQ3_S ~2.6 GB ~4 GB IQ low-bit
deepseek-coder-7b-instruct-v1.5.IQ3_M.gguf IQ3_M ~2.7 GB ~4 GB IQ balanced-low
deepseek-coder-7b-instruct-v1.5.Q2_K.gguf Q2_K ~2.7 GB ~4 GB Minimum K-quant
deepseek-coder-7b-instruct-v1.5.IQ4_XS.gguf IQ4_XS ~3.0 GB ~5 GB IQ 4-bit extreme
deepseek-coder-7b-instruct-v1.5.Q3_K_S.gguf Q3_K_S ~3.1 GB ~5 GB Small, low quality
deepseek-coder-7b-instruct-v1.5.Q3_K_M.gguf Q3_K_M ~3.3 GB ~5 GB Low VRAM default
deepseek-coder-7b-instruct-v1.5.Q3_K_L.gguf Q3_K_L ~3.6 GB ~5 GB Q3 large
deepseek-coder-7b-instruct-v1.5.Q4_K_S.gguf Q4_K_S ~3.9 GB ~6 GB Q4 small
deepseek-coder-7b-instruct-v1.5.Q4_K_M.gguf Q4_K_M ~4.2 GB ~6 GB Best quality / size
deepseek-coder-7b-instruct-v1.5.Q5_K_S.gguf Q5_K_S ~4.6 GB ~7 GB Q5 small
deepseek-coder-7b-instruct-v1.5.Q5_K_M.gguf Q5_K_M ~4.9 GB ~7 GB High quality
deepseek-coder-7b-instruct-v1.5.Q6_K.gguf Q6_K ~5.7 GB ~8 GB Near-full quality
deepseek-coder-7b-instruct-v1.5.Q8_0.gguf Q8_0 ~7.35 GB ~10 GB Edmon02 conversion; highest

Exact byte sizes: see gguf-manifest.json on this repo.

BF16 / F16: Not stored here (~14 GB). Use upstream safetensors and llama.cpp convert + llama-quantize to build custom quants (e.g. Q4_0, Q5_0).

Provenance

Quants Source
Q8_0 Original Edmon02 conversion (April 2024)
Q2_K … Q6_K, IQ3/IQ4 Synced from mradermacher/deepseek-coder-7b-instruct-v1.5-GGUF (same base model; community quant)

Repository layout

deepseek-coder-7b-instruct-v1.5-GGUF/
├── README.md
├── gguf-manifest.json
├── .gitattributes
├── deepseek-coder-7b-instruct-v1.5.Q2_K.gguf
├── deepseek-coder-7b-instruct-v1.5.Q3_K_S.gguf
├── … (all quant variants)
└── deepseek-coder-7b-instruct-v1.5.Q8_0.gguf

Download one quant

pip install -U huggingface_hub

# Example: best balance (Q4_K_M)
huggingface-cli download Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF \
  deepseek-coder-7b-instruct-v1.5.Q4_K_M.gguf \
  --local-dir ./models/deepseek-7b-q4km

# Example: maximum quality (Q8_0, Edmon02)
huggingface-cli download Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF \
  deepseek-coder-7b-instruct-v1.5.Q8_0.gguf \
  --local-dir ./models/deepseek-7b-q8

Download everything (≈50 GB total):

huggingface-cli download Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF \
  --local-dir ./models/deepseek-7b-all-quants

Python:

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF",
    filename="deepseek-coder-7b-instruct-v1.5.Q4_K_M.gguf",
)

Quick start (llama.cpp)

MODEL=deepseek-coder-7b-instruct-v1.5.Q4_K_M.gguf   # or any quant above

./llama-cli -m "$MODEL" \
  -p "### Instruction:\nWrite a binary search in Python.\n\n### Response:\n" \
  -n 512 -c 4096 --temp 0.1

./llama-server -m "$MODEL" -c 4096 --port 8080

Chat template (instruct v1.5)

### Instruction:
{user message}

### Response:

See upstream chat_template for the full Jinja definition.

Choosing a quant

Your constraint Suggested file
≤ 4 GB VRAM IQ3_XS, Q2_K, or IQ3_M
~6 GB VRAM Q4_K_M (recommended)
~8 GB VRAM Q6_K
Best quality Q8_0 (Edmon02)
Apple Silicon / CPU-only Q4_K_M or Q5_K_M

Lower quants run faster but lose syntax fidelity on long code; benchmark on your prompts.

Intended uses

  • Local code assistants (IDE, CLI, agents)
  • Offline development without API keys
  • Comparing quant trade-offs on the same Armenian/English coding stack

Out of scope

  • Non-code prompts (model often refuses by design)
  • Fine-tuning from GGUF (use upstream safetensors)
  • Guaranteed parity with DeepSeek cloud APIs

Limitations

  • IQ/K quants from mradermacher may differ slightly from Edmon02 Q8_0
  • 4K context only
  • ~50 GB if you download all files — use one quant in production

Maintainer tooling

# Upload missing quants + refresh manifest (from workspace)
python scripts/sync_deepseek_gguf_quants.py
python scripts/sync_deepseek_gguf_quants.py --remove-legacy   # drop old unnamed .gguf
python scripts/push_model_cards.py --only gguf

Citation

@misc{deepseek_coder_7b_v15_gguf,
  author = {Avetisyan, Edmon},
  title = {DeepSeek Coder 7B Instruct v1.5 (GGUF full quant collection)},
  year = {2024},
  howpublished = {\url{https://huggingface.co/Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF}}
}

License

DeepSeek Model License. Third-party quants retain the same upstream terms; verify before commercial redistribution.

Downloads last month
355
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Edmon02/deepseek-coder-7b-instruct-v1.5-GGUF

Quantized
(11)
this model