LLMs or large language models are really popular these days and many people and organization begin to rely on them. Of course, the most easiest and fast solution is to use them as chat bots and the big IT companies offer such from 20 to 200 USD dollars. But as with the cloud hype for the past 10-15 years, there are the security and privacy concerns, so the best way to mitigate those concerns is the old “run it yourself” or the private cloud. In the case, run the LLM inference at home or within the business organization. Such a lot of the privacy and security concerns are addressed. There are many open source LLMs, which will hold their owns on accuracy (intelligence?) and performance against the big ones as OpenAI, Google Gemini and more.
The purpose of the article is to show the performance of the 4th generation AMD processor AMD EPYC 9554 (Genoa) with 64 cores in a single socket board using 12 memory channels of DDR5 5600 MHz. The main testing software is llama.cpp with llama-bench.
Benchmark Results
Here are the benchmark results, which are summarized from the tests below.
N | model | parameters | quantization | tokens per second |
---|---|---|---|---|
1 | DeepSeek R1 Llama | 8B | Q4_K_M | 49.97 |
2 | DeepSeek R1 Llama | 70B | Q4_K_M | 7.11 |
4 | Qwen – QwQ-32B | 32B | Q4_K_M | 13.94 |
5 | Llama 3.1 8B Instruct | 8B | Q4_K_M | 49.64 |
6 | Llama 3.3 70B Instruct | 70B | Q4_K_M | 7.12 |
Below the llama-bench output from the benchmarks, there are pure memory tests for this setup with mlc tool (Intel® Memory Latency Checker v3.11b) and sysbench memory.
Hardware – what to expect from the AMD EPYC 9954
- AMD EPYC 9954 – AMD 64 cores / 128 threads CPU – a 12 memory channel processor with theoretical memory bandwidth 460.8 GB/s (according to the official documents form AMD)
- 192G RAM total RAM, all memory channels are utilized.
- K14PA-U24-T – Asus single socket board
- 24 slots with 12 x 16G DDR5 Samsung 5600Mhz (M321R2GA3PB0-CWMXJ)
- AMD K19 (Zen4) architecture
- CPU dies: 8
Software
All tests are made under Linux – Gentoo Linux.
- Gentoo Linux, everything built with “-native”
- Linux kernel – 6.13.7 (gentoo-kernel package)
- GNU GCC – gcc version 14.2.1 20241221
- Glibc – 2.41
Testing
Three main tests are going to be presented here using the llama.cpp for LLM inference.
- Deepseek R1 – Distill Llama 70B and Distill Llama 8B – Q4
- Qwen – QwQ-32B – Q4
- meta-llama – Llama 3.3 70B Instruct and Llama 3.1 8B Instruct – Q4
Testing benchmark with Deepseek R1 Distill Llama-70B
1. Deepseek R1 Distill Llama 70B
First test uses the quantization 4 (Q4_K_M) with 70B Deepseek R1 Distill Llama and the files are downloaded from https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF and it is used in ollama by default.
srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg128 | 7.14 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg256 | 7.13 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg512 | 7.16 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg1024 | 7.13 ± 0.01 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg2048 | 7.03 ± 0.00 | build: 51f311e0 (4753)
The generated speed is 7 tokens per second, which is pretty fast for processor only set up. It is absolutely usable for every day use.
2. Deepseek R1 Distill Llama 8B
When using the (Q4_K_M) with 8B Deepseek R1 Distill Llama the speed is 7 times more quickly than the 80B. So a 8.75 times less model parameters 7 times more quickly generated tokens. 50 tokens per second is ultra fast for generating a text for daily routines. The file was downloaded from here – huggingface.co.
srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg128 | 49.96 ± 0.05 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg256 | 49.83 ± 0.04 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg512 | 50.17 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg1024 | 49.88 ± 0.00 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg2048 | 48.82 ± 0.02 | build: 51f311e0 (4753)
3. The Qwen model QwQ-32B developed by Alibaba Cloud
The Qwen modes’ GUFFs could be downloaded from https://huggingface.co/Qwen.
srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/qwq-32b-q4_k_m.gguf -t 64 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 64 | tg128 | 14.00 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 64 | tg256 | 13.97 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 64 | tg512 | 14.00 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 64 | tg1024 | 14.01 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 64 | tg2048 | 13.74 ± 0.00 | build: 51f311e0 (4753)
9 tokens per second for the new QwQ-32B model. So the model is absolutely usable on this processor for daily routines.
4. Meta Llama 3.3 70B Instruct
The Meta open source model Llama is widely used and here are the benchmark with Llama 3.3 70B Instruct Q4_K_M
srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/MaziyarPanahi/Llama-3.3-70B-Instruct.Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg128 | 7.14 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg256 | 7.13 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg512 | 7.16 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg1024 | 7.14 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 64 | tg2048 | 7.04 ± 0.00 | build: 51f311e0 (4753)
5. Meta Llama 3.1 8B Instruct
A smaller model of the Meta Llama family. The results are almost 50 tokens per second, which is ultra fast for code generation. The GGUF file was downloaded from here – huggingface.co.
srv ~/llama.cpp/build/bin # ./llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 64 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg128 | 49.81 ± 0.06 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg256 | 49.81 ± 0.02 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg512 | 49.82 ± 0.11 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg1024 | 49.96 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 64 | tg2048 | 48.83 ± 0.02 | build: 51f311e0 (4753)
Memory bandwidth tests
According to the AMD documentations the theoretical memory bandwidth is 460.8 GB/s.
First tests are with the Intel® Memory Latency Checker v3.11b and the results are that the mlc command-line manages to read almost 460GB/s:
srv ~/mlc_v3/Linux # ./mlc --memory_bandwidth_scan Intel(R) Memory Latency Checker - v3.11b Command line parameters: --memory_bandwidth_scan Running memory bandwidth scan using 128 threads on numa node 0 accessing memory on numa node 0 Reserved 155 1GB pages Now allocating 155 1GB pages. This may take several minutes.. 1GB page allocation completed Allocating remaining 25983720 KB memory in 4KB pages Totally 178 GB memory allocated in 4KB+1GB pages on NUMA node 0 Measuring memory bandwidth for each of those 1GB memory regions.. Histogram report of BW in MB/sec across each 1GB region on NUMA node 0 BW_range(MB/sec) #_of_1GB_regions ---------------- ---------------- [370000-374999] 1 [425000-429999] 1 [430000-434999] 4 [435000-439999] 2 [440000-444999] 1 [445000-449999] 11 [450000-454999] 10 [455000-459999] 16 [460000-464999] 7 [465000-469999] 7 [470000-474999] 15 [475000-479999] 38 [480000-484999] 38 [485000-489999] 19 [490000-494999] 4 [495000-499999] 4 Detailed BW report for each 1GB region allocated as contiguous 1GB page on NUMA node 0 phys_addr GBaligned_page# MB/sec --------- --------------- ------ 0x640000000 25 373791 0x7c0000000 31 461432 0x840000000 33 446217 0x8c0000000 35 478366 0x900000000 36 489081 0x940000000 37 453880 0x980000000 38 476921 0xa00000000 40 498040 0xa40000000 41 499657 0xa80000000 42 479237 0xac0000000 43 487608 0xb00000000 44 479878 0xb40000000 45 486457 0xb80000000 46 445040 0xbc0000000 47 479162 0xc00000000 48 485666 0xc40000000 49 470877 0xc80000000 50 455536 0xcc0000000 51 463855 0xd00000000 52 473473 0xd40000000 53 477287 0xd80000000 54 479823 0xdc0000000 55 481503 0xe00000000 56 480524 0xe40000000 57 478150 0xe80000000 58 475043 0xec0000000 59 474685 0xf00000000 60 482404 0xf40000000 61 488530 0xf80000000 62 478932 0xfc0000000 63 489319 0x1000000000 64 474534 0x1040000000 65 463918 0x1080000000 66 480829 0x10c0000000 67 481290 0x1100000000 68 474027 0x1140000000 69 475619 0x1180000000 70 482599 0x11c0000000 71 480976 0x1200000000 72 475019 0x1240000000 73 477061 0x1280000000 74 493337 0x12c0000000 75 486877 0x1300000000 76 447146 0x1340000000 77 481132 0x1380000000 78 463243 0x13c0000000 79 454192 0x1400000000 80 482629 0x1440000000 81 457516 0x1480000000 82 475830 0x14c0000000 83 449708 0x1500000000 84 484403 0x1540000000 85 478519 0x1580000000 86 481798 0x15c0000000 87 485734 0x1600000000 88 478087 0x1640000000 89 489162 0x1680000000 90 497395 0x16c0000000 91 481215 0x1700000000 92 482349 0x1740000000 93 477427 0x1780000000 94 478849 0x17c0000000 95 483097 0x1800000000 96 467301 0x1840000000 97 488061 0x1880000000 98 474439 0x18c0000000 99 478811 0x1900000000 100 460426 0x1940000000 101 482257 0x1980000000 102 486956 0x19c0000000 103 471714 0x1a00000000 104 473947 0x1a40000000 105 484092 0x1a80000000 106 448975 0x1ac0000000 107 478926 0x1b00000000 108 480885 0x1b40000000 109 480943 0x1b80000000 110 476671 0x1bc0000000 111 484579 0x1c00000000 112 467751 0x1c40000000 113 457963 0x1c80000000 114 483882 0x1cc0000000 115 484567 0x1d00000000 116 480063 0x1d40000000 117 452251 0x1d80000000 118 457415 0x1dc0000000 119 481068 0x1e00000000 120 476475 0x1e40000000 121 483933 0x1e80000000 122 482085 0x1ec0000000 123 477129 0x1f00000000 124 475866 0x1f40000000 125 486586 0x1f80000000 126 477686 0x1fc0000000 127 489390 0x2000000000 128 483425 0x2040000000 129 482382 0x2080000000 130 486545 0x20c0000000 131 472699 0x2100000000 132 483781 0x2140000000 133 479401 0x2180000000 134 472585 0x21c0000000 135 480816 0x2200000000 136 456237 0x2240000000 137 480836 0x2280000000 138 495778 0x22c0000000 139 455378 0x2300000000 140 476790 0x2340000000 141 468877 0x2380000000 142 473366 0x23c0000000 143 479744 0x2400000000 144 489291 0x2440000000 145 464966 0x2480000000 146 484969 0x24c0000000 147 474751 0x2500000000 148 485469 0x2540000000 149 478304 0x2580000000 150 482716 0x25c0000000 151 477659 0x2600000000 152 489338 0x2640000000 153 477151 0x2680000000 154 458495 0x26c0000000 155 482926 0x2700000000 156 472554 0x2740000000 157 478093 0x2780000000 158 453397 0x27c0000000 159 471649 0x2800000000 160 459358 0x2840000000 161 479623 0x2880000000 162 475661 0x28c0000000 163 479670 0x2900000000 164 482914 0x2940000000 165 448677 0x2980000000 166 483432 0x29c0000000 167 450590 0x2a00000000 168 472940 0x2a40000000 169 457471 0x2a80000000 170 491832 0x2ac0000000 171 479482 0x2b00000000 172 492103 0x2b40000000 173 468062 0x2b80000000 174 477106 0x2bc0000000 175 452641 0x2c00000000 176 464928 0x2c40000000 177 487902 0x2c80000000 178 439816 0x2cc0000000 179 489015 0x2d00000000 180 484509 0x2d80000000 182 468220 0x2dc0000000 183 481232 0x2e00000000 184 482367 0x2e40000000 185 466881 0x2e80000000 186 478762 0x2ec0000000 187 467793 0x2f00000000 188 491180 Detailed BW report for each 1GB region allocated as 4KB page on NUMA node 0 phys_addr MB/sec --------- ------ 0x14be4c000 426440 0x7a1bde000 447258 0x5ed3e7000 456383 0x4fdbf0000 444288 0x397bfa000 449031 0x3fd003000 431872 0x4f300d000 453598 0x377416000 430625 0x629d21000 454936 0x41452b000 431036 0x3bd534000 455391 0x2d4093d000 458338 0x2f44547000 430591 0x4008f2000 447335 0x4d18fc000 459839 0x567505000 435754 0x5df10e000 450983 0x697d18000 459549 0x6de921000 457112 0x73652b000 447445 0x787934000 450644 0x8a413d000 447738 0xb5b2000 457635
And just for the record, mlc executed without arguments.
srv ~/Linux # ./mlc Intel(R) Memory Latency Checker - v3.11b Measuring idle latencies for random access (in ns)... Numa node Numa node 0 0 110.3 Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 368520.6 3:1 Reads-Writes : 346747.8 2:1 Reads-Writes : 334734.0 1:1 Reads-Writes : 315459.8 Stream-triad like: 350504.9 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 0 369043.8 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 580.61 368333.8 00002 579.56 368315.8 00008 582.32 368023.2 00015 582.25 368172.5 00050 579.51 368034.9 00100 575.79 368566.9 00200 161.32 294910.1 00300 143.18 198433.5 00400 138.68 145960.4 00500 137.97 117855.2 00700 128.54 85005.0 01000 128.16 59983.3 01300 127.60 46397.4 01700 122.02 35703.2 02500 121.44 24512.7 03500 121.11 17691.2 05000 120.88 12557.9 09000 120.66 7221.3 20000 120.54 3544.4 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 24.9 Local Socket L2->L2 HITM latency 24.9
Second tests with sysbench and with the 128 threads it reaches 422656.54 MiB/sec. The first test is with the 128 threads and the second test is with half of them, i.e. the CPU‘s cores 64.
srv ~ # sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=128 run sysbench 1.0.20 (using system LuaJIT 2.1.1731601260) Running the test with following options: Number of threads: 128 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 2048000MiB operation: read scope: global Initializing worker threads... Threads started! Total operations: 2097152000 (433665124.75 per second) 2048000.00 MiB transferred (423501.10 MiB/sec) General statistics: total time: 4.8353s total number of events: 2097152000 Latency (ms): min: 0.00 avg: 0.00 max: 6.68 95th percentile: 0.00 sum: 94542.00 Threads fairness: events (avg/stddev): 16384000.0000/0.00 execution time (avg/stddev): 0.7386/0.05 srv ~ # sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=64 run sysbench 1.0.20 (using system LuaJIT 2.1.1731601260) Running the test with following options: Number of threads: 64 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 2048000MiB operation: read scope: global Initializing worker threads... Threads started! Total operations: 2097152000 (288411209.57 per second) 2048000.00 MiB transferred (281651.57 MiB/sec) General statistics: total time: 7.2708s total number of events: 2097152000 Latency (ms): min: 0.00 avg: 0.00 max: 0.09 95th percentile: 0.00 sum: 73021.10 Threads fairness: events (avg/stddev): 32768000.0000/0.00 execution time (avg/stddev): 1.1410/0.06