llama-bench the Qwen2 32B (QwQ-32B) and AMD EPYC 9554 CPU

Author:

main menu
tokens per second

This article shows the CPU-only inference with a modern server processor – AMD Epyc 9554. For the LLM model the Qwen2 32B (QwQ-32B) with different quantization are used to show the difference in token generation per second and memory consumption. Qwen2 32B (QwQ-32B) is a pretty solid LLM model, which can be considered to offload some LLM work locally for free. The article is focused only showing the benchmark of the LLM tokens generations per second and there are other papers on the quality of the output for the different quantized version.
The testing bench is:

  • Single socket AMD EPYC 9554 CPU – 64 core CPU / 128 threads
  • 196GB RAM in 12 channel, all 12 CPU channels are populated with 16GB DDR5 5600MHz Samsung.
  • ASUS K14PA-U24-T Series motherboard
  • Testing with LLAMA.CPP – llama-bench
  • theoretical memory bandwidth 460.8 GB/s (according to the official documents form AMD)
  • the context window is the default 4K of the llama-bench tool. The memory consumption could vary greatly if context window is increased.
  • More information for the setup and benchmarks – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU

Here are the results. The first benchmark test is Q4 and is used as a baseline for the diff column below, because Q4 are really popular and they offer a good quality and really small footprint related to the full sized model version.

N model parameters quantization memory diff t/s % tokens/s
1 Qwen QWQ 32b 32B Q4_K_M 18.48 GiB 0 13.936
2 Qwen QWQ 32b 32B Q5_0 21.08 GiB 9.52 12.608
3 Qwen QWQ 32b 32B Q6_K 25.03 GiB 12.92 10.978
4 Qwen QWQ 32b 32B Q8_0 25.03 GiB 18.20 8.98
6 Qwen QWQ 32b 32B F16 61.03 GiB 43.22 5.098

The difference between the Q4 and F16 in the tokens per second is 63.41% speed degradation, but the problem is the around 5 tokens per second is on the brick of not usable being too slow.

Here are all the tests output:

1. QWQ 32b Q4_K_M

Using Qwen qwq-32b-q4_k_m.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/qwq-32b-q4_k_m.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |         tg128 |         13.96 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |         tg256 |         13.94 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |         tg512 |         14.06 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |        tg1024 |         13.99 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |        tg2048 |         13.73 ± 0.00 |

build: 51f311e0 (4753)

2. QWQ 32b Q5_0

Using Qwen qwq-32b-q5_0.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/qwq-32b-q5_0.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q5_0                 |  21.08 GiB |    32.76 B | BLAS,RPC   |      64 |         tg128 |         12.66 ± 0.00 |
| qwen2 32B Q5_0                 |  21.08 GiB |    32.76 B | BLAS,RPC   |      64 |         tg256 |         12.60 ± 0.01 |
| qwen2 32B Q5_0                 |  21.08 GiB |    32.76 B | BLAS,RPC   |      64 |         tg512 |         12.75 ± 0.00 |
| qwen2 32B Q5_0                 |  21.08 GiB |    32.76 B | BLAS,RPC   |      64 |        tg1024 |         12.60 ± 0.02 |
| qwen2 32B Q5_0                 |  21.08 GiB |    32.76 B | BLAS,RPC   |      64 |        tg2048 |         12.43 ± 0.01 |

build: 51f311e0 (4753)

3. QWQ 32b Q6_K

Using Qwen qwq-32b-q6_k.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/qwq-32b-q6_k.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q6_K                 |  25.03 GiB |    32.76 B | BLAS,RPC   |      64 |         tg128 |         11.00 ± 0.00 |
| qwen2 32B Q6_K                 |  25.03 GiB |    32.76 B | BLAS,RPC   |      64 |         tg256 |         10.99 ± 0.00 |
| qwen2 32B Q6_K                 |  25.03 GiB |    32.76 B | BLAS,RPC   |      64 |         tg512 |         11.04 ± 0.00 |
| qwen2 32B Q6_K                 |  25.03 GiB |    32.76 B | BLAS,RPC   |      64 |        tg1024 |         11.01 ± 0.00 |
| qwen2 32B Q6_K                 |  25.03 GiB |    32.76 B | BLAS,RPC   |      64 |        tg2048 |         10.85 ± 0.00 |

build: 51f311e0 (4753)

4. QWQ 32b Q8_0

Using Qwen qwq-32b-q8_0.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/qwq-32b-q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | BLAS,RPC   |      64 |         tg128 |          9.02 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | BLAS,RPC   |      64 |         tg256 |          8.98 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | BLAS,RPC   |      64 |         tg512 |          9.03 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | BLAS,RPC   |      64 |        tg1024 |          9.00 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | BLAS,RPC   |      64 |        tg2048 |          8.87 ± 0.00 |

build: 51f311e0 (4753)

5.

Using Qwen from qwq-32b-fp16-00001-of-00017.gguf to qwq-32b-fp16-00017-of-00017.gguf files (total 17 files).

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/qwq-32b-fp16-00001-of-00017.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B F16                  |  61.03 GiB |    32.76 B | BLAS,RPC   |      64 |         tg128 |          5.11 ± 0.00 |
| qwen2 32B F16                  |  61.03 GiB |    32.76 B | BLAS,RPC   |      64 |         tg256 |          5.11 ± 0.00 |
| qwen2 32B F16                  |  61.03 GiB |    32.76 B | BLAS,RPC   |      64 |         tg512 |          5.11 ± 0.00 |
| qwen2 32B F16                  |  61.03 GiB |    32.76 B | BLAS,RPC   |      64 |        tg1024 |          5.10 ± 0.00 |
| qwen2 32B F16                  |  61.03 GiB |    32.76 B | BLAS,RPC   |      64 |        tg2048 |          5.06 ± 0.00 |

build: 51f311e0 (4753)

Before all tests the cleaning cache commands were executed:

echo 0 > /proc/sys/kernel/numa_balancing
echo 3 > /proc/sys/vm/drop_caches

Leave a Reply

Your email address will not be published. Required fields are marked *