LLM inference benchmarks with llamacpp and AMD EPYC 9554 cpu

LLMs or large language models are really popular these days and many people and organization begin to rely on them. Of course, the most easiest and fast solution is to use them as chat bots and the big IT companies offer such from 20 to 200 USD dollars. But as with the cloud hype for the past 10-15 years, there are the security and privacy concerns, so the best way to mitigate those concerns is the old “run it yourself” or the private cloud. In the case, run the LLM inference at home or within the business organization. Such a lot of the privacy and security concerns are addressed. There are many open source LLMs, which will hold their owns on accuracy (intelligence?) and performance against the big ones as OpenAI, Google Gemini and more.
The purpose of the article is to show the performance of the 4th generation AMD processor AMD EPYC 9554 (Genoa) with 64 cores in a single socket board using 12 memory channels of DDR5 5600 MHz. The main testing software is llama.cpp with llama-bench.

Benchmark Results

Here are the benchmark results, which are summarized from the tests below.

N	model	parameters	quantization	tokens per second
1	DeepSeek R1 Llama	8B	Q4_K_M	49.97
2	DeepSeek R1 Llama	70B	Q4_K_M	7.11
4	Qwen – QwQ-32B	32B	Q4_K_M	13.94
5	Llama 3.1 8B Instruct	8B	Q4_K_M	49.64
6	Llama 3.3 70B Instruct	70B	Q4_K_M	7.12

Below the llama-bench output from the benchmarks, there are pure memory tests for this setup with mlc tool (Intel® Memory Latency Checker v3.11b) and sysbench memory.

Hardware – what to expect from the AMD EPYC 9954

AMD EPYC 9954 – AMD 64 cores / 128 threads CPU – a 12 memory channel processor with theoretical memory bandwidth 460.8 GB/s (according to the official documents form AMD)
192G RAM total RAM, all memory channels are utilized.
K14PA-U24-T – Asus single socket board
24 slots with 12 x 16G DDR5 Samsung 5600Mhz (M321R2GA3PB0-CWMXJ)
AMD K19 (Zen4) architecture
CPU dies: 8
Price in Europe Quarter 1 2025 – 7900$ without VAT.

Software

All tests are made under Linux – Gentoo Linux.

Gentoo Linux, everything built with “-native”
Linux kernel – 6.13.7 (gentoo-kernel package)
GNU GCC – gcc version 14.2.1 20241221
Glibc – 2.41

Testing

Three main tests are going to be presented here using the llama.cpp for LLM inference.

Deepseek R1 – Distill Llama 70B and Distill Llama 8B – Q4
Qwen – QwQ-32B – Q4
meta-llama – Llama 3.3 70B Instruct and Llama 3.1 8B Instruct – Q4

Testing benchmark with Deepseek R1 Distill Llama-70B

1. Deepseek R1 Distill Llama 70B

First test uses the quantization 4 (Q4_K_M) with 70B Deepseek R1 Distill Llama and the files are downloaded from https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF and it is used in ollama by default.

srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |         tg128 |          7.14 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |         tg256 |          7.13 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |         tg512 |          7.16 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |        tg1024 |          7.13 ± 0.01 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |        tg2048 |          7.03 ± 0.00 |

build: 51f311e0 (4753)

The generated speed is 7 tokens per second, which is pretty fast for processor only set up. It is absolutely usable for every day use.

2. Deepseek R1 Distill Llama 8B

When using the (Q4_K_M) with 8B Deepseek R1 Distill Llama the speed is 7 times more quickly than the 80B. So a 8.75 times less model parameters 7 times more quickly generated tokens. 50 tokens per second is ultra fast for generating a text for daily routines. The file was downloaded from here – huggingface.co.

srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |         tg128 |         49.96 ± 0.05 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |         tg256 |         49.83 ± 0.04 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |         tg512 |         50.17 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |        tg1024 |         49.88 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |        tg2048 |         48.82 ± 0.02 |

build: 51f311e0 (4753)

3. The Qwen model QwQ-32B developed by Alibaba Cloud

The Qwen modes’ GUFFs could be downloaded from https://huggingface.co/Qwen.

srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/qwq-32b-q4_k_m.gguf -t 64 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |         tg128 |         14.00 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |         tg256 |         13.97 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |         tg512 |         14.00 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |        tg1024 |         14.01 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      64 |        tg2048 |         13.74 ± 0.00 |

build: 51f311e0 (4753)

9 tokens per second for the new QwQ-32B model. So the model is absolutely usable on this processor for daily routines.

4. Meta Llama 3.3 70B Instruct

The Meta open source model Llama is widely used and here are the benchmark with Llama 3.3 70B Instruct Q4_K_M

srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/MaziyarPanahi/Llama-3.3-70B-Instruct.Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |         tg128 |          7.14 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |         tg256 |          7.13 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |         tg512 |          7.16 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |        tg1024 |          7.14 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      64 |        tg2048 |          7.04 ± 0.00 |

build: 51f311e0 (4753)

5. Meta Llama 3.1 8B Instruct

A smaller model of the Meta Llama family. The results are almost 50 tokens per second, which is ultra fast for code generation. The GGUF file was downloaded from here – huggingface.co.

srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |         tg128 |         49.81 ± 0.06 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |         tg256 |         49.81 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |         tg512 |         49.82 ± 0.11 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |        tg1024 |         49.96 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      64 |        tg2048 |         48.83 ± 0.02 |

build: 51f311e0 (4753)

Memory bandwidth tests

According to the AMD documentations the theoretical memory bandwidth is 460.8 GB/s.
First tests are with the Intel® Memory Latency Checker v3.11b and the results are that the mlc command-line manages to read almost 460GB/s:

srv ~/mlc_v3/Linux # ./mlc --memory_bandwidth_scan
Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --memory_bandwidth_scan 
Running memory bandwidth scan using 128 threads on numa node 0 accessing memory on numa node 0
Reserved 155 1GB pages
Now allocating 155 1GB pages. This may take several minutes..
1GB page allocation completed
Allocating remaining 25983720 KB memory in 4KB pages
Totally 178 GB memory allocated in 4KB+1GB pages on NUMA node 0
Measuring memory bandwidth for each of those 1GB memory regions..

Histogram report of BW in MB/sec across each 1GB region on NUMA node 0
BW_range(MB/sec)        #_of_1GB_regions
----------------        ----------------
[370000-374999] 1
[425000-429999] 1
[430000-434999] 4
[435000-439999] 2
[440000-444999] 1
[445000-449999] 11
[450000-454999] 10
[455000-459999] 16
[460000-464999] 7
[465000-469999] 7
[470000-474999] 15
[475000-479999] 38
[480000-484999] 38
[485000-489999] 19
[490000-494999] 4
[495000-499999] 4

Detailed BW report for each 1GB region allocated as contiguous 1GB page on NUMA node 0
phys_addr       GBaligned_page# MB/sec
---------       --------------- ------
0x640000000     25      373791
0x7c0000000     31      461432
0x840000000     33      446217
0x8c0000000     35      478366
0x900000000     36      489081
0x940000000     37      453880
0x980000000     38      476921
0xa00000000     40      498040
0xa40000000     41      499657
0xa80000000     42      479237
0xac0000000     43      487608
0xb00000000     44      479878
0xb40000000     45      486457
0xb80000000     46      445040
0xbc0000000     47      479162
0xc00000000     48      485666
0xc40000000     49      470877
0xc80000000     50      455536
0xcc0000000     51      463855
0xd00000000     52      473473
0xd40000000     53      477287
0xd80000000     54      479823
0xdc0000000     55      481503
0xe00000000     56      480524
0xe40000000     57      478150
0xe80000000     58      475043
0xec0000000     59      474685
0xf00000000     60      482404
0xf40000000     61      488530
0xf80000000     62      478932
0xfc0000000     63      489319
0x1000000000    64      474534
0x1040000000    65      463918
0x1080000000    66      480829
0x10c0000000    67      481290
0x1100000000    68      474027
0x1140000000    69      475619
0x1180000000    70      482599
0x11c0000000    71      480976
0x1200000000    72      475019
0x1240000000    73      477061
0x1280000000    74      493337
0x12c0000000    75      486877
0x1300000000    76      447146
0x1340000000    77      481132
0x1380000000    78      463243
0x13c0000000    79      454192
0x1400000000    80      482629
0x1440000000    81      457516
0x1480000000    82      475830
0x14c0000000    83      449708
0x1500000000    84      484403
0x1540000000    85      478519
0x1580000000    86      481798
0x15c0000000    87      485734
0x1600000000    88      478087
0x1640000000    89      489162
0x1680000000    90      497395
0x16c0000000    91      481215
0x1700000000    92      482349
0x1740000000    93      477427
0x1780000000    94      478849
0x17c0000000    95      483097
0x1800000000    96      467301
0x1840000000    97      488061
0x1880000000    98      474439
0x18c0000000    99      478811
0x1900000000    100     460426
0x1940000000    101     482257
0x1980000000    102     486956
0x19c0000000    103     471714
0x1a00000000    104     473947
0x1a40000000    105     484092
0x1a80000000    106     448975
0x1ac0000000    107     478926
0x1b00000000    108     480885
0x1b40000000    109     480943
0x1b80000000    110     476671
0x1bc0000000    111     484579
0x1c00000000    112     467751
0x1c40000000    113     457963
0x1c80000000    114     483882
0x1cc0000000    115     484567
0x1d00000000    116     480063
0x1d40000000    117     452251
0x1d80000000    118     457415
0x1dc0000000    119     481068
0x1e00000000    120     476475
0x1e40000000    121     483933
0x1e80000000    122     482085
0x1ec0000000    123     477129
0x1f00000000    124     475866
0x1f40000000    125     486586
0x1f80000000    126     477686
0x1fc0000000    127     489390
0x2000000000    128     483425
0x2040000000    129     482382
0x2080000000    130     486545
0x20c0000000    131     472699
0x2100000000    132     483781
0x2140000000    133     479401
0x2180000000    134     472585
0x21c0000000    135     480816
0x2200000000    136     456237
0x2240000000    137     480836
0x2280000000    138     495778
0x22c0000000    139     455378
0x2300000000    140     476790
0x2340000000    141     468877
0x2380000000    142     473366
0x23c0000000    143     479744
0x2400000000    144     489291
0x2440000000    145     464966
0x2480000000    146     484969
0x24c0000000    147     474751
0x2500000000    148     485469
0x2540000000    149     478304
0x2580000000    150     482716
0x25c0000000    151     477659
0x2600000000    152     489338
0x2640000000    153     477151
0x2680000000    154     458495
0x26c0000000    155     482926
0x2700000000    156     472554
0x2740000000    157     478093
0x2780000000    158     453397
0x27c0000000    159     471649
0x2800000000    160     459358
0x2840000000    161     479623
0x2880000000    162     475661
0x28c0000000    163     479670
0x2900000000    164     482914
0x2940000000    165     448677
0x2980000000    166     483432
0x29c0000000    167     450590
0x2a00000000    168     472940
0x2a40000000    169     457471
0x2a80000000    170     491832
0x2ac0000000    171     479482
0x2b00000000    172     492103
0x2b40000000    173     468062
0x2b80000000    174     477106
0x2bc0000000    175     452641
0x2c00000000    176     464928
0x2c40000000    177     487902
0x2c80000000    178     439816
0x2cc0000000    179     489015
0x2d00000000    180     484509
0x2d80000000    182     468220
0x2dc0000000    183     481232
0x2e00000000    184     482367
0x2e40000000    185     466881
0x2e80000000    186     478762
0x2ec0000000    187     467793
0x2f00000000    188     491180

Detailed BW report for each 1GB region allocated as 4KB page on NUMA node 0
phys_addr       MB/sec
---------       ------
0x14be4c000     426440
0x7a1bde000     447258
0x5ed3e7000     456383
0x4fdbf0000     444288
0x397bfa000     449031
0x3fd003000     431872
0x4f300d000     453598
0x377416000     430625
0x629d21000     454936
0x41452b000     431036
0x3bd534000     455391
0x2d4093d000    458338
0x2f44547000    430591
0x4008f2000     447335
0x4d18fc000     459839
0x567505000     435754
0x5df10e000     450983
0x697d18000     459549
0x6de921000     457112
0x73652b000     447445
0x787934000     450644
0x8a413d000     447738
0xb5b2000       457635

And just for the record, mlc executed without arguments.

srv ~/Linux # ./mlc 
Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0  
       0         110.3  

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      368520.6
3:1 Reads-Writes :      346747.8
2:1 Reads-Writes :      334734.0
1:1 Reads-Writes :      315459.8
Stream-triad like:      350504.9

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0  
       0        369043.8

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  580.61   368333.8
 00002  579.56   368315.8
 00008  582.32   368023.2
 00015  582.25   368172.5
 00050  579.51   368034.9
 00100  575.79   368566.9
 00200  161.32   294910.1
 00300  143.18   198433.5
 00400  138.68   145960.4
 00500  137.97   117855.2
 00700  128.54    85005.0
 01000  128.16    59983.3
 01300  127.60    46397.4
 01700  122.02    35703.2
 02500  121.44    24512.7
 03500  121.11    17691.2
 05000  120.88    12557.9
 09000  120.66     7221.3
 20000  120.54     3544.4

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        24.9
Local Socket L2->L2 HITM latency        24.9

Second tests with sysbench and with the 128 threads it reaches 422656.54 MiB/sec. The first test is with the 128 threads and the second test is with half of them, i.e. the CPU‘s cores 64.

srv ~ # sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=128 run
sysbench 1.0.20 (using system LuaJIT 2.1.1731601260)

Running the test with following options:
Number of threads: 128
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 2048000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 2097152000 (433665124.75 per second)

2048000.00 MiB transferred (423501.10 MiB/sec)


General statistics:
    total time:                          4.8353s
    total number of events:              2097152000

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    6.68
         95th percentile:                        0.00
         sum:                                94542.00

Threads fairness:
    events (avg/stddev):           16384000.0000/0.00
    execution time (avg/stddev):   0.7386/0.05

srv ~ # sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=64 run
sysbench 1.0.20 (using system LuaJIT 2.1.1731601260)

Running the test with following options:
Number of threads: 64
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 2048000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 2097152000 (288411209.57 per second)

2048000.00 MiB transferred (281651.57 MiB/sec)


General statistics:
    total time:                          7.2708s
    total number of events:              2097152000

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.09
         95th percentile:                        0.00
         sum:                                73021.10

Threads fairness:
    events (avg/stddev):           32768000.0000/0.00
    execution time (avg/stddev):   1.1410/0.06

Notes

Before all tests a drop of the system cache was executed:

echo 0 > /proc/sys/kernel/numa_balancing 
echo 3 > /proc/sys/vm/drop_caches

Whether dropping the cache or not may have a significant impact on the test results.

Any IT here? Help Me!

LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU