llama-bench with Gemma 3 27B Instruct and AMD EPYC 9554

This article shows the CPU-only inference with a modern server processor – AMD Epyc 9554. For the LLM model the Google’s Gemma 3 27B Instruct with different quantization are used to show the difference in token generation per second and memory consumption. Gemma 3 27B Instruct is a pretty solid LLM model, offered for free by the Google giant, and in many cases it can be considered to offload some LLM work locally for free. The article is focused only showing the benchmark of the LLM tokens generations per second and there are other papers on the quality of the output for the different quantized version.
The testing bench is:

Single socket AMD EPYC 9554 CPU – 64 core CPU / 128 threads
196GB RAM in 12 channel, all 12 CPU channels are populated with 16GB DDR5 5600MHz Samsung.
ASUS K14PA-U24-T Series motherboard
Testing with LLAMA.CPP – llama-bench
theoretical memory bandwidth 460.8 GB/s (according to the official documents form AMD)
the context window is the default 4K of the llama-bench tool. The memory consumption could vary greatly if context window is increased.
More information for the setup and benchmarks – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU

Here are the results. The first benchmark test is Q4 and is used as a baseline for the diff column below, because Q4 are really popular and they offer a good quality and really small footprint related to the full sized model version.

N	model	parameters	quantization	memory	diff t/s %	tokens/s
1	Gemma 3 27B Instruct	27B	Q4_K_M	15.40 GiB	0	14.272
2	Gemma 3 27B Instruct	27B	Q5_0	17.94 GiB	10.66	12.75
3	Gemma 3 27B Instruct	27B	Q6_K	20.64 GiB	9.6	11.526
4	Gemma 3 27B Instruct	27B	Q8_0	26.73 GiB	16.64	9.608
6	Gemma 3 27B Instruct	27B	F16	50.31 GiB	41.84	5.588

The difference between the Q4 and F16 in the tokens per second is 60.84% speed degradation, but the problem is the around 5 tokens per second is on the brick of not usable being too slow. Around 14 tokens per second is good and usable for daily use for a single user.

Here are all the tests output:

1. Gemma 3 27b instruct Q4_K_M

Using unsloth gemma-3-27b-it-Q4_K_M.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/gemma-3-27b-it-Q4_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | BLAS,RPC   |      64 |           tg128 |         14.62 ± 0.01 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | BLAS,RPC   |      64 |           tg256 |         14.54 ± 0.01 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | BLAS,RPC   |      64 |           tg512 |         14.42 ± 0.01 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | BLAS,RPC   |      64 |          tg1024 |         14.19 ± 0.00 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | BLAS,RPC   |      64 |          tg2048 |         13.59 ± 0.01 |

build: 66625a59 (6040)

2. Gemma 3 27b instruct Q5_K_M

Using unsloth gemma-3-27b-it-Q5_K_M.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/gemma-3-27b-it-Q5_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 27B Q5_K - Medium       |  17.94 GiB |    27.01 B | BLAS,RPC   |      64 |           tg128 |         13.07 ± 0.00 |
| gemma3 27B Q5_K - Medium       |  17.94 GiB |    27.01 B | BLAS,RPC   |      64 |           tg256 |         12.88 ± 0.00 |
| gemma3 27B Q5_K - Medium       |  17.94 GiB |    27.01 B | BLAS,RPC   |      64 |           tg512 |         12.96 ± 0.00 |
| gemma3 27B Q5_K - Medium       |  17.94 GiB |    27.01 B | BLAS,RPC   |      64 |          tg1024 |         12.77 ± 0.00 |
| gemma3 27B Q5_K - Medium       |  17.94 GiB |    27.01 B | BLAS,RPC   |      64 |          tg2048 |         12.07 ± 0.00 |

build: 66625a59 (6040)

3. Gemma 3 27b instruct Q6_K

Using unsloth gemma-3-27b-it-Q6_K.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/gemma-3-27b-it-Q6_K.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 27B Q6_K                |  20.64 GiB |    27.01 B | BLAS,RPC   |      64 |           tg128 |         11.76 ± 0.00 |
| gemma3 27B Q6_K                |  20.64 GiB |    27.01 B | BLAS,RPC   |      64 |           tg256 |         11.62 ± 0.00 |
| gemma3 27B Q6_K                |  20.64 GiB |    27.01 B | BLAS,RPC   |      64 |           tg512 |         11.66 ± 0.04 |
| gemma3 27B Q6_K                |  20.64 GiB |    27.01 B | BLAS,RPC   |      64 |          tg1024 |         11.56 ± 0.01 |
| gemma3 27B Q6_K                |  20.64 GiB |    27.01 B | BLAS,RPC   |      64 |          tg2048 |         11.03 ± 0.02 |

build: 66625a59 (6040)

4. Gemma 3 27b instruct Q8_0

Using unsloth gemma-3-27b-it-Q8_0.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/gemma-3-27b-it-Q8_0.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | BLAS,RPC   |      64 |           tg128 |          9.73 ± 0.01 |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | BLAS,RPC   |      64 |           tg256 |          9.74 ± 0.00 |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | BLAS,RPC   |      64 |           tg512 |          9.70 ± 0.00 |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | BLAS,RPC   |      64 |          tg1024 |          9.55 ± 0.00 |
| gemma3 27B Q8_0                |  26.73 GiB |    27.01 B | BLAS,RPC   |      64 |          tg2048 |          9.32 ± 0.01 |

build: 66625a59 (6040)

5. Gemma 3 27b instruct F16

Using unsloth from gemma-3-27b-it-BF16-00001-of-00002.gguf and gemma-3-27b-it-BF16-00001-of-00002.gguf files.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/gemma-3-27b-it-BF16-00001-of-00002.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | BLAS,RPC   |      64 |           tg128 |          5.65 ± 0.00 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | BLAS,RPC   |      64 |           tg256 |          5.64 ± 0.00 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | BLAS,RPC   |      64 |           tg512 |          5.60 ± 0.00 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | BLAS,RPC   |      64 |          tg1024 |          5.57 ± 0.00 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | BLAS,RPC   |      64 |          tg2048 |          5.48 ± 0.00 |

build: 66625a59 (6040)

Before all tests the cleaning cache commands were executed:

echo 0 > /proc/sys/kernel/numa_balancing
echo 3 > /proc/sys/vm/drop_caches

Any IT here? Help Me!

llama-bench the Gemma 3 27B and AMD EPYC 9554 CPU

1. Gemma 3 27b instruct Q4_K_M

2. Gemma 3 27b instruct Q5_K_M

3. Gemma 3 27b instruct Q6_K

4. Gemma 3 27b instruct Q8_0

5. Gemma 3 27b instruct F16

Leave a Reply Cancel reply