vLLM support is...

#4
by mrs83 - opened

...work in progress!

Screenshot from 2026-05-16 15-05-42

Inference Benchmark — echo-hybrid

Inference benchmark on single AMD Radeon AI Pro R9700

Endpoint: http://localhost:8001/v1
Prompt: Explain the concept of recursion briefly.
Max tokens: 64 · Warmup: 2 req · Date: 2026-05-16 13:15 UTC

Concurrency Throughput (req/s) TTFT p50 TTFT p95 Latency p50 Latency p95 Errors
1 0.88 48 ms 57 ms 1137 ms 1152 ms 0
2 1.69 58 ms 83 ms 1187 ms 1199 ms 0
4 3.35 74 ms 86 ms 1193 ms 1203 ms 0
8 5.19 82 ms 91 ms 1540 ms 1548 ms 0
16 10.21 74 ms 97 ms 1565 ms 1584 ms 0
32 20.34 94 ms 109 ms 1570 ms 1584 ms 0

Inference benchmark on single AMD Instinct MI300X VF - vLLM on ROCm 7.2.0

Inference Benchmark — echo-hybrid

Endpoint: http://127.0.0.1:8001/v1
Prompt: Explain the concept of recursion briefly.
Max tokens: 256 · Warmup: 2 req · Date: 2026-05-16 15:58 UTC

Concurrency Throughput (req/s) TTFT p50 TTFT p95 Latency p50 Latency p95 Errors
1 0.24 73 ms 83 ms 3826 ms 7277 ms 0
2 0.45 98 ms 115 ms 3832 ms 7711 ms 0
4 0.76 101 ms 114 ms 4439 ms 7897 ms 0
8 1.15 108 ms 124 ms 5527 ms 9148 ms 0
16 2.10 113 ms 133 ms 6180 ms 9307 ms 0
32 4.20 113 ms 131 ms 5889 ms 9339 ms 0
64 7.87 120 ms 140 ms 6598 ms 10047 ms 0

Sign up or log in to comment