llama-bench with Phi-4 14B and AMD EPYC 9554

This article shows the CPU-only inference with a modern server processor – AMD Epyc 9554. For the LLM model the Microsoft’s Phi-4 14B with different quantization are used to show the difference in token generation per second and memory consumption. Phi-4 14B is a relatively small and good LLM, which uses reasoning and logic. It is published by the Microsoft giant, and in many cases it can be considered to offload some LLM work locally for free (especially Maths and IT). Probably, the main use cases are memory constrained environments. The article is focused only showing the benchmark of the LLM tokens generations per second and there are other papers on the quality of the output for the different quantized version.
The testing bench is:

Single socket AMD EPYC 9554 CPU – 64 core CPU / 128 threads
196GB RAM in 12 channel, all 12 CPU channels are populated with 16GB DDR5 5600MHz Samsung.
ASUS K14PA-U24-T Series motherboard
Testing with LLAMA.CPP – llama-bench
theoretical memory bandwidth 460.8 GB/s (according to the official documents form AMD)
the context window is the default 4K of the llama-bench tool. The memory consumption could vary greatly if context window is increased.
More information for the setup and benchmarks – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU

Here are the results. The first benchmark test is Q4 and is used as a baseline for the diff column below, because Q4 are really popular and they offer a good quality and really small footprint related to the full sized model version. There are tests with the full length of the model’s weights – float 32 bit.

N	model	parameters	quantization	memory	diff t/s %	tokens/s
1	Phi-4 14B	14.66 B	Q4_K_M	8.43 GiB	0	28.74
2	Phi-4 14B	14.66 B	Q5_0	9.87 GiB	11.46	25.444
3	Phi-4 14B	14.66 B	Q6_K	11.20 GiB	8.63	23.246
4	Phi-4 14B	14.66 B	Q8_0	14.51 GiB	17.43	19.194
5	Phi-4 14B	14.66 B	F16	27.31 GiB	41.27	11.272
6	Phi-4 14B	14.66 B	F32	54.61 GiB	47.05	5.968

The difference between the Q4 and F32 in the tokens per second is 79.23% speed degradation and even the F32 (float 32 bit) is usable with tokens generation around 5-6 per second. Around 6-11 tokens per second is good and usable for daily use for a single user, which is what the CPU inference would offer.

Here are all the tests output:

1. Phi-4 14B Q4_K_M

Using bartowski phi-4-Q4_K_M.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/phi-4-Q4_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| phi3 14B Q4_K - Medium         |   8.43 GiB |    14.66 B | BLAS,RPC   |      64 |           tg128 |         29.54 ± 0.03 |
| phi3 14B Q4_K - Medium         |   8.43 GiB |    14.66 B | BLAS,RPC   |      64 |           tg256 |         29.39 ± 0.01 |
| phi3 14B Q4_K - Medium         |   8.43 GiB |    14.66 B | BLAS,RPC   |      64 |           tg512 |         29.14 ± 0.01 |
| phi3 14B Q4_K - Medium         |   8.43 GiB |    14.66 B | BLAS,RPC   |      64 |          tg1024 |         28.61 ± 0.01 |
| phi3 14B Q4_K - Medium         |   8.43 GiB |    14.66 B | BLAS,RPC   |      64 |          tg2048 |         27.02 ± 0.04 |

build: 66625a59 (6040)

2. Phi-4 14B Q5_K_M

Using bartowski phi-4-Q5_K_M.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/phi-4-Q5_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| phi3 14B Q5_K - Medium         |   9.87 GiB |    14.66 B | BLAS,RPC   |      64 |           tg128 |         26.08 ± 0.01 |
| phi3 14B Q5_K - Medium         |   9.87 GiB |    14.66 B | BLAS,RPC   |      64 |           tg256 |         25.98 ± 0.01 |
| phi3 14B Q5_K - Medium         |   9.87 GiB |    14.66 B | BLAS,RPC   |      64 |           tg512 |         25.77 ± 0.01 |
| phi3 14B Q5_K - Medium         |   9.87 GiB |    14.66 B | BLAS,RPC   |      64 |          tg1024 |         25.36 ± 0.00 |
| phi3 14B Q5_K - Medium         |   9.87 GiB |    14.66 B | BLAS,RPC   |      64 |          tg2048 |         24.03 ± 0.03 |

build: 66625a59 (6040)

3. Phi-4 14B Q6_K

Using bartowski phi-4-Q6_K.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/phi-4-Q6_K.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| phi3 14B Q6_K                  |  11.20 GiB |    14.66 B | BLAS,RPC   |      64 |           tg128 |         23.81 ± 0.02 |
| phi3 14B Q6_K                  |  11.20 GiB |    14.66 B | BLAS,RPC   |      64 |           tg256 |         23.69 ± 0.03 |
| phi3 14B Q6_K                  |  11.20 GiB |    14.66 B | BLAS,RPC   |      64 |           tg512 |         23.52 ± 0.01 |
| phi3 14B Q6_K                  |  11.20 GiB |    14.66 B | BLAS,RPC   |      64 |          tg1024 |         23.17 ± 0.00 |
| phi3 14B Q6_K                  |  11.20 GiB |    14.66 B | BLAS,RPC   |      64 |          tg2048 |         22.04 ± 0.02 |

build: 66625a59 (6040)

4. Phi-4 14B Q8_0

Using bartowski phi-4-Q8_0.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/phi-4-Q8_0.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| phi3 14B Q8_0                  |  14.51 GiB |    14.66 B | BLAS,RPC   |      64 |           tg128 |         19.53 ± 0.00 |
| phi3 14B Q8_0                  |  14.51 GiB |    14.66 B | BLAS,RPC   |      64 |           tg256 |         19.48 ± 0.00 |
| phi3 14B Q8_0                  |  14.51 GiB |    14.66 B | BLAS,RPC   |      64 |           tg512 |         19.39 ± 0.00 |
| phi3 14B Q8_0                  |  14.51 GiB |    14.66 B | BLAS,RPC   |      64 |          tg1024 |         19.14 ± 0.00 |
| phi3 14B Q8_0                  |  14.51 GiB |    14.66 B | BLAS,RPC   |      64 |          tg2048 |         18.43 ± 0.01 |
build: 66625a59 (6040)

5. Phi-4 14B F16

Using bartowski phi-4-f16.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/phi-4-f16.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | BLAS,RPC   |      64 |           tg128 |         11.40 ± 0.00 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | BLAS,RPC   |      64 |           tg256 |         11.39 ± 0.00 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | BLAS,RPC   |      64 |           tg512 |         11.33 ± 0.00 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | BLAS,RPC   |      64 |          tg1024 |         11.25 ± 0.00 |
| phi3 14B F16                   |  27.31 GiB |    14.66 B | BLAS,RPC   |      64 |          tg2048 |         10.99 ± 0.00 |

build: 66625a59 (6040)

6. Phi-4 14B F32

Using bartowski Qwen3-32B-Q8_0.gguf file.

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 64 -p 0 -n 128,256,512,1024,2048  -m /root/models/tests/phi-4-f32-00001-of-00002.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| phi3 14B all F32               |  54.61 GiB |    14.66 B | BLAS,RPC   |      64 |           tg128 |          6.00 ± 0.00 |
| phi3 14B all F32               |  54.61 GiB |    14.66 B | BLAS,RPC   |      64 |           tg256 |          6.00 ± 0.00 |
| phi3 14B all F32               |  54.61 GiB |    14.66 B | BLAS,RPC   |      64 |           tg512 |          5.99 ± 0.00 |
| phi3 14B all F32               |  54.61 GiB |    14.66 B | BLAS,RPC   |      64 |          tg1024 |          5.96 ± 0.00 |
| phi3 14B all F32               |  54.61 GiB |    14.66 B | BLAS,RPC   |      64 |          tg2048 |          5.89 ± 0.00 |

build: 66625a59 (6040)

Before all tests the cleaning cache commands were executed:

echo 0 > /proc/sys/kernel/numa_balancing
echo 3 > /proc/sys/vm/drop_caches

Any IT here? Help Me!

llama-bench the Phi-4 14B and AMD EPYC 9554 CPU