LLM inference benchmarks with llamacpp and dual Xeon Gold 5317 cpus

Author:

main menuLLMs or large language models are really popular these days and many people and organization begin to rely on them. This article continues in the spirit of the CPU only benchmarks in the realm of LLM inteference. Check out the other article on the subject with a much better and expensive processor – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU Several times cheaper is the setup presented in this article along with the AMD option here – LLM inference benchmarks with llamacpp and AMD EPYC 7282 CPU.
“Run it yourself” in your home or within business organization is always a way more secure and privacy safe than the cloud based AI chat bots/assistants/help. There are many open source LLMs, which will hold their owns on accuracy (intelligence?) and performance against the big ones as OpenAI, Google Gemini and more.
The purpose of the article is to show the performance of the 3th generation Intel processor Xeon Gold 6312U with 24 cores in a single socket board using 4 memory channels of DDR4 3200 MHz of 8 supported. Note the processor supports 8 channels, but the current setup presented here is with two processors each with 4 RAM sticks, so each processor is using half of the supported (and theoretical) memory bandwidth. The main testing software is llama.cpp with llama-bench. Using only the half of the theoretical memory bandwidth significantly reduces the LLM inference performance by probably about 35-40%.

Benchmark Results

Here are the benchmark results, which are summarized from the tests below.

N model parameters quantization tokens per second
1 DeepSeek R1 Llama 8B Q4_K_M 21.69
2 DeepSeek R1 Llama 70B Q4_K_M 2.91
4 Qwen – QwQ-32B 32B Q4_K_M 5.97
5 Llama 3.1 8B Instruct 8B Q4_K_M 21.61
6 Llama 3.3 70B Instruct 70B Q4_K_M 2.92

Note: the server has only half of the memory slot populated, so only half of the available CPU memory channels are used, which impacts the memory performance and LLM inference greatly. If fully populated memory slots are used in this server, it may double the results, which remains to be tested.
Below the llama-bench output from the benchmarks, there are pure memory tests for this setup with mlc tool (Intel® Memory Latency Checker v3.11b) and sysbench memory.

Hardware – what to expect from the Intel Xeon Gold 5317

  • Xeon Gold 5317 cpu – 12 cores / 24 threads CPU – a 8 memory channel processor with memory bandwidth around 160 GB/s
  • 128G RAM total RAM, half of memory channels are utilized per CPU socket – 1 x 8 channels.
  • Asus Z12PG-D16 dual socket board with 16 memory slots.
  • 8 slots with 16G DDR4 Samsung 3200Mhz (M393A2K43EB3-CWE)
  • Intel Icelake SP processor
  • CPU dies: 1 per CPU, total 2
  • With similar dual socket motherboard (with 128G RAM) the price in eBay Quarter 1 2025 – $3000 ~ $3400 USD.

Software

All tests are made under LinuxGentoo Linux.

  • Gentoo Linux, everything built with “-native”
  • Linux kernel – 6.14.2 (gentoo-kernel package)
  • GNU GCC – gcc version 14.2.1 20241221
  • Glibc – 2.41

Testing

Three main tests are going to be presented here using the llama.cpp for LLM inference.

  1. Deepseek R1 – Distill Llama 70B and Distill Llama 8B – Q4
  2. Qwen – QwQ-32B – Q4
  3. meta-llama – Llama 3.3 70B Instruct and Llama 3.1 8B Instruct – Q4

Testing benchmark with Deepseek R1 Distill Llama-70B

1. Deepseek R1 Distill Llama 70B

First test uses the quantization 4 (Q4_K_M) with 70B Deepseek R1 Distill Llama and the files are downloaded from https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF and it is used in ollama by default.

srv ~ $ /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg128 |          2.94 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg256 |          2.94 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg512 |          2.92 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg1024 |          2.90 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg2048 |          2.86 ± 0.00 |

build: 7ad0779f (4764)

The generated speed is 2.9 tokens per second, which is not pretty fast for processor only set up. It is really questionable whether it is usable.

2. Deepseek R1 Distill Llama 8B

When using the (Q4_K_M) with 8B Deepseek R1 Distill Llama the speed is 7 times more quickly than the 70B. So a 7.7 times less model parameters 7.4 times more quickly generated tokens. Around 21 tokens per second is fast for generating a text for daily routines. The file was downloaded from here – huggingface.co.

/root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg128 |         22.40 ± 0.06 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg256 |         22.03 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg512 |         21.71 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg1024 |         21.46 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg2048 |         20.83 ± 0.03 |

build: 7ad0779f (4764)

3. The Qwen model QwQ-32B developed by Alibaba Cloud

The Qwen modes’ GUFFs could be downloaded from https://huggingface.co/Qwen.

srv ~ $ /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |         tg128 |          6.08 ± 0.01 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |         tg256 |          6.05 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |         tg512 |          6.00 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |        tg1024 |          5.92 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |        tg2048 |          5.83 ± 0.00 |

build: 7ad0779f (4764)

6 tokens per second for the new QwQ-32B model. So probably the model is usable on this processor for daily routines.

4. Meta Llama 3.3 70B Instruct

The Meta open source model Llama is widely used and here are the benchmark with Llama 3.3 70B Instruct Q4_K_M

srv ~ $ /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/MaziyarPanahi/Llama-3.3-70B-Instruct.Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg128 |          2.95 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg256 |          2.95 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg512 |          2.93 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg1024 |          2.91 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg2048 |          2.87 ± 0.00 |

build: 7ad0779f (4764)

5. Meta Llama 3.1 8B Instruct

A smaller model of the Meta Llama family. The results are around 21 tokens per second, which is fast enough for code generation. The GGUF file was downloaded from here – huggingface.co.

srv ~ /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/Meta-Llama/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg128 |         22.35 ± 0.28 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg256 |         21.90 ± 0.04 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg512 |         21.68 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg1024 |         21.36 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg2048 |         20.76 ± 0.01 |

build: 7ad0779f (4764)

Memory bandwidth tests

First tests are with the Intel® Memory Latency Checker v3.11b and the results are that the mlc command-line manages to read almost 80GB/s and for the dual CPUs, but with half of the memory channels used. So theoretically if the all 16 memory slots are populated the bandwidth might be 160G or even more.

srv ~/mlc_v3.11b/Linux # ./mlc --memory_bandwidth_scan
Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --memory_bandwidth_scan 
Running memory bandwidth scan using 24 threads on numa node 0 accessing memory on numa node 0
Reserved 21 1GB pages
Now allocating 21 1GB pages. This may take several minutes..
1GB page allocation completed
Allocating remaining 43237708 KB memory in 4KB pages
Totally 61 GB memory allocated in 4KB+1GB pages on NUMA node 0
Measuring memory bandwidth for each of those 1GB memory regions..

Histogram report of BW in MB/sec across each 1GB region on NUMA node 0
BW_range(MB/sec)        #_of_1GB_regions
----------------        ----------------
[75000-79999]   2
[80000-84999]   59

Detailed BW report for each 1GB region allocated as contiguous 1GB page on NUMA node 0
phys_addr       GBaligned_page# MB/sec
---------       --------------- ------
0x1c0000000     7       79327
0x200000000     8       80433
0x500000000     20      80464
0x540000000     21      80215
0x580000000     22      80361
0x5c0000000     23      80635
0x600000000     24      80686
0x680000000     26      80544
0x6c0000000     27      80375
0x700000000     28      80593
0x740000000     29      80668
0x780000000     30      80542
0xb40000000     45      80514
0xb80000000     46      80381
0xc00000000     48      80293
0xc40000000     49      80650
0xc80000000     50      80420
0xe40000000     57      80304
0xe80000000     58      80534
0xf00000000     60      80354
0xfc0000000     63      80403

Detailed BW report for each 1GB region allocated as 4KB page on NUMA node 0
phys_addr       MB/sec
---------       ------
0x299f16000     80389
0xafacf7000     80792
0x1a7d00000     80648
0x26250a000     80626
0x9fa913000     80797
0x8dd51d000     81028
0x436926000     80554
0x48112f000     80528
0x46c939000     80563
0x1028d42000    80617
0x2b754c000     80575
0xcf3155000     80648
0xded15e000     80552
0x8f0d68000     80638
0x8bf571000     80535
0xa7c97a000     80560
0x67f984000     80603
0x834d8d000     80784
0x934597000     80613
0x127da0000     80631
0x1b79a9000     80476
0x2e39b3000     80277
0x3641bc000     80525
0x3e05c6000     80537
0x7c09cf000     80617
0x816dd8000     80584
0x8c11e2000     80468
0x95c9eb000     80451
0xa57df4000     80546
0xac5dfe000     80285
0xb0f207000     80187
0xbcfa11000     80349
0xd0c21a000     80209
0xd4c623000     80203
0xd94e2d000     80439
0xe14636000     80664
0xed4a40000     80441
0xf60e49000     79915
0xfa0e52000     80563
0x415c3000      80079

And just for the record, mlc executed without arguments. It’s interesting that the apparently the memory bandwidth is a double the memory bandwidth regions test above – around 160GB/s, which means the full potential might be 320Gb/s with all the 16 slots populated, but this could not be tested at this time.

srv ~/mlc_v3/Linux $ ./mlc 
Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for sequential access (in ns)...
                Numa node
Numa node            0       1  
       0          70.5   128.4  
       1         130.3    68.9  

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      158393.5
3:1 Reads-Writes :      143656.9
2:1 Reads-Writes :      139743.7
1:1 Reads-Writes :      129235.9
Stream-triad like:      144465.6

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1  
       0        79240.5 55020.1 
       1        55056.9 79173.4 

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  175.68   157921.9
 00002  175.95   157958.1
 00008  174.81   157562.1
 00015  172.30   156796.1
 00050  159.36   154094.7
 00100  143.40   149554.2
 00200   96.69    86135.6
 00300   86.93    63026.2
 00400   84.13    48676.2
 00500   82.29    39413.5
 00700   81.87    28633.1
 01000   75.94    20474.2
 01300   74.81    16010.0
 01700   73.58    12492.7
 02500   72.43     8808.4
 03500   71.64     6566.8
 05000   70.76     4881.5
 09000   69.94     3127.6
 20000   69.40     1918.9

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        49.0
Local Socket L2->L2 HITM latency        49.3
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1  
            0        -   114.0  
            1    114.9       -  
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1  
            0        -   117.3  
            1    117.7       -

Second tests with sysbench and with the 48 threads it reaches 165331.57 MiB/sec – probably half potential of the this dual processors setup. The first test is with the 48 threads and the second test is with half of them, i.e. the CPUs’ cores 24. The second test is just 110730.53, which suggests the sysbench did not use the full potential because of NUMA awareness.

srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=48 run
sysbench 1.0.20 (using system LuaJIT 2.1.1731601260)

Running the test with following options:
Number of threads: 48
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 2048000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1693441719 (169299529.58 per second)

1653751.68 MiB transferred (165331.57 MiB/sec)


General statistics:
    total time:                          10.0010s
    total number of events:              1693441719

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    6.68
         95th percentile:                        0.00
         sum:                               102259.60

Threads fairness:
    events (avg/stddev):           35280035.8125/548392.04
    execution time (avg/stddev):   2.1304/0.05

srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=24 run
sysbench 1.0.20 (using system LuaJIT 2.1.1731601260)

Running the test with following options:
Number of threads: 24
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 2048000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1134124285 (113388063.53 per second)

1107543.25 MiB transferred (110730.53 MiB/sec)


General statistics:
    total time:                          10.0005s
    total number of events:              1134124285

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.06
         95th percentile:                        0.00
         sum:                                46845.02

Threads fairness:
    events (avg/stddev):           47255178.5417/323670.79
    execution time (avg/stddev):   1.9519/0.01

Notes

System cache

Before all tests a drop of the system cache was executed:

echo 0 > /proc/sys/kernel/numa_balancing 
echo 3 > /proc/sys/vm/drop_caches 

Whether dropping the cache or not may have a significant impact on the test results.

NUMA and hardware topology

The NUMA configuration is 2 nodes:

srv ~ $ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 
preferred:

The hardware CPU, cores, threads, cache and NUMA topology with likwid-topology

srv ~ $ /usr/local/likwid/bin/likwid-topology 
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz
CPU type:       Intel Icelake SP processor
CPU stepping:   6
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:                2
CPU dies:               2
Cores per socket:       12
Threads per core:       2
--------------------------------------------------------------------------------
HWThread        Thread        Core        Die        Socket        Available
0               0             0           0          0             *                
1               0             1           0          0             *                
2               0             2           0          0             *                
3               0             3           0          0             *                
4               0             4           0          0             *                
5               0             5           0          0             *                
6               0             6           0          0             *                
7               0             7           0          0             *                
8               0             8           0          0             *                
9               0             9           0          0             *                
10              0             10          0          0             *                
11              0             11          0          0             *                
12              0             12          0          1             *                
13              0             13          0          1             *                
14              0             14          0          1             *                
15              0             15          0          1             *                
16              0             16          0          1             *                
17              0             17          0          1             *                
18              0             18          0          1             *                
19              0             19          0          1             *                
20              0             20          0          1             *                
21              0             21          0          1             *                
22              0             22          0          1             *                
23              0             23          0          1             *                
24              1             0           0          0             *                
25              1             1           0          0             *                
26              1             2           0          0             *                
27              1             3           0          0             *                
28              1             4           0          0             *                
29              1             5           0          0             *                
30              1             6           0          0             *                
31              1             7           0          0             *                
32              1             8           0          0             *                
33              1             9           0          0             *                
34              1             10          0          0             *                
35              1             11          0          0             *                
36              1             12          0          1             *                
37              1             13          0          1             *                
38              1             14          0          1             *                
39              1             15          0          1             *                
40              1             16          0          1             *                
41              1             17          0          1             *                
42              1             18          0          1             *                
43              1             19          0          1             *                
44              1             20          0          1             *                
45              1             21          0          1             *                
46              1             22          0          1             *                
47              1             23          0          1             *                
--------------------------------------------------------------------------------
Socket 0:               ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 )
Socket 1:               ( 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:                  1
Size:                   48 kB
Cache groups:           ( 0 24 ) ( 1 25 ) ( 2 26 ) ( 3 27 ) ( 4 28 ) ( 5 29 ) ( 6 30 ) ( 7 31 ) ( 8 32 ) ( 9 33 ) ( 10 34 ) ( 11 35 ) ( 12 36 ) ( 13 37 ) ( 14 38 ) ( 15 39 ) ( 16 40 ) ( 17 41 ) ( 18 42 ) ( 19 43 ) ( 20 44 ) ( 21 45 ) ( 22 46 ) ( 23 47 )
--------------------------------------------------------------------------------
Level:                  2
Size:                   1.25 MB
Cache groups:           ( 0 24 ) ( 1 25 ) ( 2 26 ) ( 3 27 ) ( 4 28 ) ( 5 29 ) ( 6 30 ) ( 7 31 ) ( 8 32 ) ( 9 33 ) ( 10 34 ) ( 11 35 ) ( 12 36 ) ( 13 37 ) ( 14 38 ) ( 15 39 ) ( 16 40 ) ( 17 41 ) ( 18 42 ) ( 19 43 ) ( 20 44 ) ( 21 45 ) ( 22 46 ) ( 23 47 )
--------------------------------------------------------------------------------
Level:                  3
Size:                   18 MB
Cache groups:           ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 ) ( 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:           2
--------------------------------------------------------------------------------
Domain:                 0
Processors:             ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 )
Distances:              10 20
Free memory:            63490.4 MB
Total memory:           64032.1 MB
--------------------------------------------------------------------------------
Domain:                 1
Processors:             ( 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 )
Distances:              20 10
Free memory:            63864.1 MB
Total memory:           64495.8 MB
--------------------------------------------------------------------------------

Leave a Reply

Your email address will not be published. Required fields are marked *