Thank you very much

#13
by LaskarisAdrian - opened

Hi, hope you guys doing well.
First and foremost, I'm a poor-man in the sense that top-tier hardware doesn't actually fit my budget at the moment. But I needed a reliable "brain-only" workstation to serve my needs, fine-tuned the best I could, with these old specs, and my modest knowledge:

  • Intel Xeon E5-2697A v4
  • 64 GB DDR4
  • 1 x RTX 4080 Super
  • Headless Ubuntu Server, llama.cpp only (plus the models and a few launching scripts, nothing else)

Tested and tried different llama.cpp builds, the best approach I tried with best results so far:
Driver Version: 570.211.01 + CUDA Version: 12.8 - Cuda compilation tools, release 12.8, V12.8.93 - Build cuda_12.8.r12.8/compiler.35583870_0

Built llama like so:
cmake .. -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc -DGGML_CUDA=ON -DGGML_BLAS=ON -DLLAMA_OPENSSL=ON -DGGML_BLAS_VENDOR=OpenBLAS -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"
make llama-server -j28

launch llama like so:
exec numactl --interleave=all ./llama.cpp-mtp/build/bin/llama-server
-m /home/laskaris/ai/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--parallel 1
--timeout 600
--gpu-layers 99
--override-tensor "blk..*ffn_.*exps=CPU"
--threads 16
--mlock
--no-mmap
--ctx-size 262144
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
--grp-attn-n 4
--grp-attn-w 1024
--temp 0.6
--top-p 0.8
--top-k 20
--min-p 0.05
--repeat-penalty 1.5
--spec-type draft-mtp
--spec-draft-n-max 3
--spec-draft-p-min 0.75
--jinja

So, this is just to say, thank you! It works for me.

Screenshot from 2026-05-17 04-12-28

Screenshot from 2026-05-17 04-14-07

Sign up or log in to comment