LLMs or large language models are really popular these days and many people and organization begin to rely on them. This article continues in the spirit of the CPU only benchmarks in the realm of LLM inteference. Check out the other article on the subject with a much better and expensive processor – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU Several times cheaper is the setup presented in this article.
“Run it yourself” in your home or within business organization is always a way more secure and privacy safe than the cloud based AI chat bots/assistants/help. There are many open source LLMs, which will hold their owns on accuracy (intelligence?) and performance against the big ones as OpenAI, Google Gemini and more.
The purpose of the article is to show the performance of the 2th generation AMD processor AMD EPYC 7282 (Rome) with 16 cores in a dual socket board using 2 x 8 memory channels of DDR4 3200 MHz. The main testing software is llama.cpp with llama-bench.
Benchmark Results
Here are the benchmark results, which are summarized from the tests below.
N | model | parameters | quantization | tokens per second |
---|---|---|---|---|
1 | DeepSeek R1 Llama | 8B | Q4_K_M | 2.84 |
2 | DeepSeek R1 Llama | 70B | Q4_K_M | 22.17 |
4 | Qwen – QwQ-32B | 32B | Q4_K_M | 5.81 |
5 | Llama 3.1 8B Instruct | 8B | Q4_K_M | 22.24 |
6 | Llama 3.3 70B Instruct | 70B | Q4_K_M | 2.832 |
Below the llama-bench output from the benchmarks, there are pure memory tests for this setup with mlc tool (Intel® Memory Latency Checker v3.11b) and sysbench memory.
Hardware – what to expect from the AMD EPYC 7282
- 2 x AMD EPYC 7282 – AMD 16 cores / 32 threads CPU – a 8 memory channel processor with theoretical memory bandwidth 80 GB/s (according to the official documents from AMD)
- 128G RAM total RAM, all memory channels are utilized on both processors, i.e. 2 x 8 channels.
- Supermicro – H11DSU-iN dual socket board with 32 memory slots.
- 24 slots with 16 x 8G DDR4 Samsung 3200Mhz (M393A1K43DB2-CWE)
- AMD K17 (Zen2) architecture
- CPU dies: 1 per CPU
- With similar dual socket motherboard (with 128G RAM) the price in eBay Quarter 1 2025 – $1200 ~ $1500 USD.
Software
All tests are made under Linux – Gentoo Linux.
- Gentoo Linux, everything built with “-native”
- Linux kernel – 6.13.3 (gentoo-kernel package)
- GNU GCC – gcc version 14.2.1 20241221
- Glibc – 2.41
Testing
Three main tests are going to be presented here using the llama.cpp for LLM inference.
- Deepseek R1 – Distill Llama 70B and Distill Llama 8B – Q4
- Qwen – QwQ-32B – Q4
- meta-llama – Llama 3.3 70B Instruct and Llama 3.1 8B Instruct – Q4
Testing benchmark with Deepseek R1 Distill Llama-70B
1. Deepseek R1 Distill Llama 70B
First test uses the quantization 4 (Q4_K_M) with 70B Deepseek R1 Distill Llama and the files are downloaded from https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF and it is used in ollama by default.
srv ~/llama.cpp/build/bin $ ./llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -t 32 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg128 | 2.88 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg256 | 2.87 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg512 | 2.86 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg1024 | 2.82 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg2048 | 2.78 ± 0.00 | build: 51f311e0 (4753)
The generated speed is 2.6~2.7 tokens per second, which is not pretty fast for processor only set up. It is really questionable whether it is usable.
2. Deepseek R1 Distill Llama 8B
When using the (Q4_K_M) with 8B Deepseek R1 Distill Llama the speed is 7 times more quickly than the 80B. So a 8.75 times less model parameters 7.5 times more quickly generated tokens. 22 tokens per second is fast for generating a text for daily routines. The file was downloaded from here – huggingface.co.
srv ~/llama.cpp/build/bin $ ./llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 32 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg128 | 22.65 ± 0.03 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg256 | 22.58 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg512 | 22.58 ± 0.02 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg1024 | 22.06 ± 0.00 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg2048 | 21.00 ± 0.03 | build: 51f311e0 (4753)
3. The Qwen model QwQ-32B developed by Alibaba Cloud
The Qwen modes’ GUFFs could be downloaded from https://huggingface.co/Qwen.
srv ~/llama.cpp/build/bin $./llama-bench --numa distribute -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 32 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 32 | tg128 | 5.85 ± 0.01 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 32 | tg256 | 5.82 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 32 | tg512 | 5.85 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 32 | tg1024 | 5.83 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 32 | tg2048 | 5.70 ± 0.00 | build: 51f311e0 (4753)
5-6 tokens per second for the new QwQ-32B model. So probably the model is usable on this processor for daily routines.
4. Meta Llama 3.3 70B Instruct
The Meta open source model Llama is widely used and here are the benchmark with Llama 3.3 70B Instruct Q4_K_M
srv ~/llama.cpp/build/bin $./llama-bench --numa distribute -m /root/models/MaziyarPanahi/Llama-3.3-70B-Instruct.Q4_K_M.gguf -t 32 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg128 | 2.87 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg256 | 2.86 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg512 | 2.85 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg1024 | 2.81 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 32 | tg2048 | 2.77 ± 0.00 | build: 51f311e0 (4753)
5. Meta Llama 3.1 8B Instruct
A smaller model of the Meta Llama family. The results are around 22 tokens per second, which is fast enough for code generation. The GGUF file was downloaded from here – huggingface.co.
srv ~/llama.cpp/build/bin $ ./llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 32 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg128 | 22.72 ± 0.06 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg256 | 22.67 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg512 | 22.61 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg1024 | 22.11 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 32 | tg2048 | 21.10 ± 0.01 | build: 51f311e0 (4753)
Memory bandwidth tests
According to the AMD documentations the theoretical memory bandwidth is 80 GB/s.
First tests are with the Intel® Memory Latency Checker v3.11b and the results are that the mlc command-line manages to read almost 66GB/s and for the total of the two CPUs – 130GB/s.
srv ~/mlc_v3/Linux $ ./mlc --memory_bandwidth_scan Intel(R) Memory Latency Checker - v3.11b Command line parameters: --memory_bandwidth_scan Running memory bandwidth scan using 32 threads on numa node 0 accessing memory on numa node 0 Reserved 33 1GB pages Now allocating 33 1GB pages. This may take several minutes.. 1GB page allocation completed Allocating remaining 30809672 KB memory in 4KB pages Totally 61 GB memory allocated in 4KB+1GB pages on NUMA node 0 Measuring memory bandwidth for each of those 1GB memory regions.. Histogram report of BW in MB/sec across each 1GB region on NUMA node 0 BW_range(MB/sec) #_of_1GB_regions ---------------- ---------------- [65000-69999] 61 Detailed BW report for each 1GB region allocated as contiguous 1GB page on NUMA node 0 phys_addr GBaligned_page# MB/sec --------- --------------- ------ 0x140000000 5 66033 0x180000000 6 66363 0x1c0000000 7 66177 0x200000000 8 65846 0x240000000 9 65674 0x280000000 10 66303 0x2c0000000 11 65940 0x380000000 14 65692 0x4c0000000 19 65691 0x540000000 21 65739 0x6c0000000 27 65741 0x840000000 33 66066 0x880000000 34 66001 0x940000000 37 65762 0xb00000000 44 65871 0xb40000000 45 67632 0xb80000000 46 66186 0xbc0000000 47 65758 0xc00000000 48 65780 0xc40000000 49 65905 0xcc0000000 51 65830 0xd00000000 52 65779 0xd40000000 53 65721 0xd80000000 54 66023 0xdc0000000 55 65996 0xe00000000 56 65558 0xe40000000 57 65855 0xe80000000 58 66048 0xec0000000 59 66050 0xf00000000 60 65896 0xf40000000 61 66353 0xf80000000 62 66171 0xfc0000000 63 65889 Detailed BW report for each 1GB region allocated as 4KB page on NUMA node 0 phys_addr MB/sec --------- ------ 0x6bdbd4000 65942 0x697ef7000 65964 0x622301000 65943 0x77170a000 65869 0x7ccf13000 65910 0x83e31d000 65958 0x90f326000 65936 0x9a032f000 66037 0xa15b39000 65814 0xa77742000 66043 0x4bdf4c000 65898 0x354355000 66206 0x3e0f5e000 65982 0x5bfb68000 65933 0x727f71000 65915 0x5db77b000 66004 0xc88784000 65942 0x100c38d000 66241 0x404b97000 65653 0x4817a0000 65965 0x51ffaa000 65673 0x5babb3000 65830 0x7013bc000 65843 0x81cbc6000 65786 0x9d0fcf000 65974 0xa413d8000 66212 0xae27e2000 65908 0x18c29000 66060
And just for the record, mlc executed without arguments.
srv ~/mlc_v3/Linux $ ./mlc Intel(R) Memory Latency Checker - v3.11b Measuring idle latencies for random access (in ns)... Numa node Numa node 0 1 0 119.9 261.7 1 261.7 119.7 Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 131319.2 3:1 Reads-Writes : 132123.5 2:1 Reads-Writes : 134047.2 1:1 Reads-Writes : 143820.7 Stream-triad like: 143737.8 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 65452.7 30597.0 1 30466.6 65363.4 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 550.75 131339.0 00002 553.37 131099.9 00008 550.86 131628.0 00015 550.84 131525.3 00050 547.06 131630.8 00100 517.39 132043.8 00200 156.93 111832.4 00300 145.16 78346.4 00400 140.68 60374.9 00500 138.92 49102.7 00700 131.48 35717.7 01000 129.31 25379.9 01300 125.52 19746.9 01700 125.54 15262.5 02500 126.38 10571.3 03500 126.99 7708.2 05000 127.94 5549.0 09000 128.91 3307.1 20000 130.10 1758.0 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 23.1 Local Socket L2->L2 HITM latency 23.3 Remote Socket L2->L2 HITM latency (data address homed in writer socket) Reader Numa Node Writer Numa Node 0 1 0 - 278.7 1 277.6 - Remote Socket L2->L2 HITM latency (data address homed in reader socket) Reader Numa Node Writer Numa Node 0 1 0 - 283.1 1 284.9 -
Second tests with sysbench and with the 64 threads it reaches 162450.34 MiB/sec – full potential of the dual processors system. The first test is with the 64 threads and the second test is with half of them, i.e. the CPUs’ cores 32. Because there are two NUMA nodes and the second CPU is not utilized fully, the second test is just 90666.83 MiB/sec, which suggests the sysbench did not use the full potential because of NUMA awareness.
srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=64 run sysbench 1.0.20 (using system LuaJIT 2.1.1731601260) Running the test with following options: Number of threads: 64 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 2048000MiB operation: read scope: global Initializing worker threads... Threads started! Total operations: 1663878927 (166349143.55 per second) 1624881.76 MiB transferred (162450.34 MiB/sec) General statistics: total time: 10.0011s total number of events: 1663878927 Latency (ms): min: 0.00 avg: 0.00 max: 10.01 95th percentile: 0.00 sum: 96569.21 Threads fairness: events (avg/stddev): 25998108.2344/710279.75 execution time (avg/stddev): 1.5089/0.06 srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=32 run sysbench 1.0.20 (using system LuaJIT 2.1.1731601260) Running the test with following options: Number of threads: 32 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 2048000MiB operation: read scope: global Initializing worker threads... Threads started! Total operations: 928597066 (92842833.78 per second) 906833.07 MiB transferred (90666.83 MiB/sec) General statistics: total time: 10.0006s total number of events: 928597066 Latency (ms): min: 0.00 avg: 0.00 max: 0.70 95th percentile: 0.00 sum: 42355.25 Threads fairness: events (avg/stddev): 29018658.3125/142227.06 execution time (avg/stddev): 1.3236/0.01 <h2>Notes</h2> <h4>System cache<h4> Before all tests a drop of the system cache was executed: [code lang="bash" highligh="1,2"] echo 0 > /proc/sys/kernel/numa_balancing echo 3 > /proc/sys/vm/drop_caches
Whether dropping the cache or not may have a significant impact on the test results.
NUMA and hardware topology
The NUMA configuration is 2 nodes:
srv ~ $ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
cpubind: 0 1
nodebind: 0 1
membind: 0 1
preferred:
srv ~ $ numactl -s policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 cpubind: 0 1 nodebind: 0 1 membind: 0 1 preferred:
The hardware CPU, cores, threads, cache and NUMA topology with likwid-topology
srv ~ $/usr/local/likwid/bin/likwid-topology -------------------------------------------------------------------------------- CPU name: AMD EPYC 7282 16-Core Processor CPU type: AMD K17 (Zen2) architecture CPU stepping: 0 ******************************************************************************** Hardware Thread Topology ******************************************************************************** Sockets: 2 CPU dies: 2 Cores per socket: 16 Threads per core: 2 -------------------------------------------------------------------------------- HWThread Thread Core Die Socket Available 0 0 0 0 0 * 1 0 1 0 0 * 2 0 2 0 0 * 3 0 3 0 0 * 4 0 4 0 0 * 5 0 5 0 0 * 6 0 6 0 0 * 7 0 7 0 0 * 8 0 8 0 0 * 9 0 9 0 0 * 10 0 10 0 0 * 11 0 11 0 0 * 12 0 12 0 0 * 13 0 13 0 0 * 14 0 14 0 0 * 15 0 15 0 0 * 16 0 16 0 1 * 17 0 17 0 1 * 18 0 18 0 1 * 19 0 19 0 1 * 20 0 20 0 1 * 21 0 21 0 1 * 22 0 22 0 1 * 23 0 23 0 1 * 24 0 24 0 1 * 25 0 25 0 1 * 26 0 26 0 1 * 27 0 27 0 1 * 28 0 28 0 1 * 29 0 29 0 1 * 30 0 30 0 1 * 31 0 31 0 1 * 32 1 0 0 0 * 33 1 1 0 0 * 34 1 2 0 0 * 35 1 3 0 0 * 36 1 4 0 0 * 37 1 5 0 0 * 38 1 6 0 0 * 39 1 7 0 0 * 40 1 8 0 0 * 41 1 9 0 0 * 42 1 10 0 0 * 43 1 11 0 0 * 44 1 12 0 0 * 45 1 13 0 0 * 46 1 14 0 0 * 47 1 15 0 0 * 48 1 16 0 1 * 49 1 17 0 1 * 50 1 18 0 1 * 51 1 19 0 1 * 52 1 20 0 1 * 53 1 21 0 1 * 54 1 22 0 1 * 55 1 23 0 1 * 56 1 24 0 1 * 57 1 25 0 1 * 58 1 26 0 1 * 59 1 27 0 1 * 60 1 28 0 1 * 61 1 29 0 1 * 62 1 30 0 1 * 63 1 31 0 1 * -------------------------------------------------------------------------------- Socket 0: ( 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 ) Socket 1: ( 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63 ) -------------------------------------------------------------------------------- ******************************************************************************** Cache Topology ******************************************************************************** Level: 1 Size: 32 kB Cache groups: ( 0 32 ) ( 1 33 ) ( 2 34 ) ( 3 35 ) ( 4 36 ) ( 5 37 ) ( 6 38 ) ( 7 39 ) ( 8 40 ) ( 9 41 ) ( 10 42 ) ( 11 43 ) ( 12 44 ) ( 13 45 ) ( 14 46 ) ( 15 47 ) ( 16 48 ) ( 17 49 ) ( 18 50 ) ( 19 51 ) ( 20 52 ) ( 21 53 ) ( 22 54 ) ( 23 55 ) ( 24 56 ) ( 25 57 ) ( 26 58 ) ( 27 59 ) ( 28 60 ) ( 29 61 ) ( 30 62 ) ( 31 63 ) -------------------------------------------------------------------------------- Level: 2 Size: 512 kB Cache groups: ( 0 32 ) ( 1 33 ) ( 2 34 ) ( 3 35 ) ( 4 36 ) ( 5 37 ) ( 6 38 ) ( 7 39 ) ( 8 40 ) ( 9 41 ) ( 10 42 ) ( 11 43 ) ( 12 44 ) ( 13 45 ) ( 14 46 ) ( 15 47 ) ( 16 48 ) ( 17 49 ) ( 18 50 ) ( 19 51 ) ( 20 52 ) ( 21 53 ) ( 22 54 ) ( 23 55 ) ( 24 56 ) ( 25 57 ) ( 26 58 ) ( 27 59 ) ( 28 60 ) ( 29 61 ) ( 30 62 ) ( 31 63 ) -------------------------------------------------------------------------------- Level: 3 Size: 16 MB Cache groups: ( 0 32 1 33 2 34 3 35 ) ( 4 36 5 37 6 38 7 39 ) ( 8 40 9 41 10 42 11 43 ) ( 12 44 13 45 14 46 15 47 ) ( 16 48 17 49 18 50 19 51 ) ( 20 52 21 53 22 54 23 55 ) ( 24 56 25 57 26 58 27 59 ) ( 28 60 29 61 30 62 31 63 ) -------------------------------------------------------------------------------- ******************************************************************************** NUMA Topology ******************************************************************************** NUMA domains: 2 -------------------------------------------------------------------------------- Domain: 0 Processors: ( 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 ) Distances: 10 32 Free memory: 39754.6 MB Total memory: 64246.9 MB -------------------------------------------------------------------------------- Domain: 1 Processors: ( 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63 ) Distances: 32 10 Free memory: 39940 MB Total memory: 64493.9 MB --------------------------------------------------------------------------------