LLM inference benchmarks with llamacpp and Xeon Gold 6312U cpu

Author:

main menuLLMs or large language models are really popular these days and many people and organization begin to rely on them. This article continues in the spirit of the CPU only benchmarks in the realm of LLM inteference. Check out the other article on the subject with a much better and expensive processor – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU Several times cheaper is the setup presented in this article along with the AMD option here – LLM inference benchmarks with llamacpp and AMD EPYC 7282 CPU.
“Run it yourself” in your home or within business organization is always a way more secure and privacy safe than the cloud based AI chat bots/assistants/help. There are many open source LLMs, which will hold their owns on accuracy (intelligence?) and performance against the big ones as OpenAI, Google Gemini and more.
The purpose of the article is to show the performance of the 3th generation Intel processor Xeon Gold 6312U with 24 cores in a single socket board using 8 memory channels of DDR4 3200 MHz. The main testing software is llama.cpp with llama-bench.

Benchmark Results

Here are the benchmark results, which are summarized from the tests below.

N model parameters quantization tokens per second
1 DeepSeek R1 Llama 8B Q4_K_M 21.25
2 DeepSeek R1 Llama 70B Q4_K_M 2.74
4 Qwen – QwQ-32B 32B Q4_K_M 5.67
5 Llama 3.1 8B Instruct 8B Q4_K_M 21.20
6 Llama 3.3 70B Instruct 70B Q4_K_M 2.74

Below the llama-bench output from the benchmarks, there are pure memory tests for this setup with mlc tool (Intel® Memory Latency Checker v3.11b) and sysbench memory.

Hardware – what to expect from the Intel Xeon Gold 6312U

  • Intel Xeon Gold 6312U – 24 cores / 48 threads CPU – a 8 memory channel processor with memory bandwidth around 170 GB/s
  • 128G RAM total RAM, all memory channels are utilized – 1 x 8 channels.
  • Supermicro – SPC621D8U-2T single socket board with 8 memory slots.
  • 8 slots with 8 x 16G DDR4 Samsung 3200Mhz (NT16GA72D8PFX3K-JR)
  • Intel Icelake SP architecture
  • CPU dies: 1 per CPU
  • With similar single socket motherboard (with 128G RAM) the price in eBay Quarter 1 2025 – $1800 ~ $2000 USD.

Software

All tests are made under LinuxGentoo Linux.

  • Gentoo Linux, everything built with “-native”
  • Linux kernel – 6.13.5 (gentoo-kernel package)
  • GNU GCC – gcc version 14.2.1 20241221
  • Glibc – 2.41

Testing

Three main tests are going to be presented here using the llama.cpp for LLM inference.

  1. Deepseek R1 – Distill Llama 70B and Distill Llama 8B – Q4
  2. Qwen – QwQ-32B – Q4
  3. meta-llama – Llama 3.3 70B Instruct and Llama 3.1 8B Instruct – Q4

Testing benchmark with Deepseek R1 Distill Llama-70B

1. Deepseek R1 Distill Llama 70B

First test uses the quantization 4 (Q4_K_M) with 70B Deepseek R1 Distill Llama and the files are downloaded from https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF and it is used in ollama by default.

srv ~ $ /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg128 |          2.77 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg256 |          2.76 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg512 |          2.74 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg1024 |          2.72 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg2048 |          2.69 ± 0.00 |

build: dfd6b2c0 (4818)

The generated speed is 2.7 tokens per second, which is not pretty fast for processor only set up. It is really questionable whether it is usable.

2. Deepseek R1 Distill Llama 8B

When using the (Q4_K_M) with 8B Deepseek R1 Distill Llama the speed is 7 times more quickly than the 80B. So a 7.7 times less model parameters 7.5 times more quickly generated tokens. 21 tokens per second is fast for generating a text for daily routines. The file was downloaded from here – huggingface.co.

srv ~ /root/llama.cpp/build/bin/llama-bench -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg128 |         21.84 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg256 |         21.67 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg512 |         21.38 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg1024 |         20.99 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg2048 |         20.39 ± 0.00 |

build: dfd6b2c0 (4818)

3. The Qwen model QwQ-32B developed by Alibaba Cloud

The Qwen modes’ GUFFs could be downloaded from https://huggingface.co/Qwen.

srv ~ $ /root/llama.cpp/build/bin/llama-bench -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |         tg128 |          5.76 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |         tg256 |          5.74 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |         tg512 |          5.70 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |        tg1024 |          5.62 ± 0.00 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | BLAS,RPC   |      24 |        tg2048 |          5.52 ± 0.00 |

build: dfd6b2c0 (4818)

5-6 tokens per second for the new QwQ-32B model. So probably the model is usable on this processor for daily routines.

4. Meta Llama 3.3 70B Instruct

The Meta open source model Llama is widely used and here are the benchmark with Llama 3.3 70B Instruct Q4_K_M

srv ~ $ /root/llama.cpp/build/bin/llama-bench -m /root/models/MaziyarPanahi/Llama-3.3-70B-Instruct.Q4_K_M.gguf  -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg128 |          2.78 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg256 |          2.77 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |         tg512 |          2.75 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg1024 |          2.73 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | BLAS,RPC   |      24 |        tg2048 |          2.69 ± 0.00 |

build: dfd6b2c0 (4818)

5. Meta Llama 3.1 8B Instruct

A smaller model of the Meta Llama family. The results are around 22 tokens per second, which is fast enough for code generation. The GGUF file was downloaded from here – huggingface.co.

srv ~ $ /root/llama.cpp/build/bin/llama-bench -m /root/models/Meta-Llama/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg128 |         21.80 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg256 |         21.62 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |         tg512 |         21.32 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg1024 |         20.92 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | BLAS,RPC   |      24 |        tg2048 |         20.35 ± 0.00 |

build: dfd6b2c0 (4818)

Memory bandwidth tests

First tests are with the Intel® Memory Latency Checker v3.11b and the results are that the mlc command-line manages to read almost 173GB/s and for the single CPU.

srv ~/mlc_v3/Linux # ./mlc --memory_bandwidth_scan
Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --memory_bandwidth_scan 
Running memory bandwidth scan using 48 threads on numa node 0 accessing memory on numa node 0
Reserved 54 1GB pages
Now allocating 54 1GB pages. This may take several minutes..
1GB page allocation completed
Allocating remaining 74230536 KB memory in 4KB pages
Totally 122 GB memory allocated in 4KB+1GB pages on NUMA node 0
Measuring memory bandwidth for each of those 1GB memory regions..

Histogram report of BW in MB/sec across each 1GB region on NUMA node 0
BW_range(MB/sec)        #_of_1GB_regions
----------------        ----------------
[170000-174999] 3
[175000-179999] 119

Detailed BW report for each 1GB region allocated as contiguous 1GB page on NUMA node 0
phys_addr       GBaligned_page# MB/sec
---------       --------------- ------
0x440000000     17      173315
0x600000000     24      175668
0x640000000     25      175367
0x680000000     26      176248
0x6c0000000     27      175893
0x700000000     28      175804
0x740000000     29      175577
0x780000000     30      175870
0x7c0000000     31      175855
0x800000000     32      175587
0x840000000     33      175792
0x880000000     34      175772
0x8c0000000     35      175392
0x900000000     36      175680
0x940000000     37      175818
0x980000000     38      175693
0x9c0000000     39      175925
0xa00000000     40      175519
0xa40000000     41      175834
0xa80000000     42      175712
0xac0000000     43      175803
0xb40000000     45      176032
0xb80000000     46      175351
0xbc0000000     47      175613
0xc00000000     48      175305
0xc40000000     49      175619
0xcc0000000     51      175987
0xd00000000     52      176226
0xd80000000     54      175636
0xdc0000000     55      176033
0xe00000000     56      176087
0xe40000000     57      175848
0xe80000000     58      176803
0xf00000000     60      175517
0xf40000000     61      176320
0x1080000000    66      175739
0x1100000000    68      175487
0x11c0000000    71      175803
0x1280000000    74      175875
0x12c0000000    75      175990
0x1300000000    76      176194
0x1340000000    77      176459
0x1400000000    80      175514
0x1440000000    81      175833
0x1480000000    82      175455
0x1500000000    84      175898
0x1540000000    85      175807
0x1580000000    86      175335
0x15c0000000    87      175461
0x1640000000    89      175563
0x1680000000    90      175595
0x1a40000000    105     175343
0x1dc0000000    119     175726
0x1e80000000    122     175324

Detailed BW report for each 1GB region allocated as 4KB page on NUMA node 0
phys_addr       MB/sec
---------       ------
0xb0e7cd000     174968
0x1fffc6c000    174787
0x4d8995000     175322
0x2a791f000     175799
0x17dea8000     175319
0x4d7ab2000     176422
0x1ae3dbb000    177073
0x245dc4000     176582
0x1ac41ce000    176018
0x5af9d7000     175776
0x19201e1000    176089
0x18979ea000    175826
0x185f1f3000    175665
0x19561fd000    175805
0x1016e06000    176435
0x17fbe0f000    175859
0x17daa19000    176015
0x1761222000    175788
0x43fa2c000     175556
0xb10635000     176373
0x13e363e000    176616
0x3e9a48000     175985
0x575651000     176050
0x19d765b000    176468
0x195ca64000    176742
0x1ab426d000    175656
0x53c277000     175751
0xd56680000     176120
0x522a89000     176069
0x16d4693000    176231
0x58be9c000     176467
0x163e11c000    175962
0x279125000     176164
0x5cb22a000     175479
0x3b3235000     176027
0x59fe3e000     175802
0x13ab2c6000    176248
0xfb5acf000     175765
0x10ce6d8000    175744
0x1068db8000    176166
0x101b5c1000    176211
0x1fe19cb000    175696
0x1f399d4000    176518
0x1eca5dd000    176649
0x1da81e7000    176601
0x1c461f0000    176032
0x1c645f9000    176610
0x1d1ca03000    176076
0x1c9e20c000    176076
0x1bb6216000    176309
0x2c021f000     176642
0x496228000     176056
0xc82a32000     175401
0xf9ea3b000     175818
0x1045245000    175947
0x1195a8d000    175108
0x122a696000    175826
0x14cbaa0000    175953
0x16c56a9000    175988
0x178b2b2000    176711
0x19076bc000    175807
0x1ac12c5000    175802
0x1bc9acf000    175786
0x1c1c6d8000    175408
0x1d072e1000    175594
0x1e1f2eb000    175803
0x1f136f4000    175692
0x1f64afe000    175488


And just for the record, mlc executed without arguments.

srv ~/mlc_v3/Linux $ ./mlc 
Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for sequential access (in ns)...
                Numa node
Numa node            0
       0         134.7

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      173464.8
3:1 Reads-Writes :      155992.6
2:1 Reads-Writes :      152662.5
1:1 Reads-Writes :      128919.2
Stream-triad like:      160139.3

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        173251.7

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  189.48   172439.0
 00002  189.91   172130.1
 00008  188.10   170764.5
 00015  182.81   169457.8
 00050  165.67   161445.8
 00100  134.49   141202.0
 00200   99.04    75465.1
 00300   93.10    56414.2
 00400   88.26    44538.6
 00500   84.69    36080.3
 00700   81.36    26254.1
 01000   78.87    18758.2
 01300   77.71    14681.6
 01700   76.24    11466.5
 02500   73.84     8114.4
 03500   72.84     6065.9
 05000   71.98     4525.1
 09000   70.25     2934.0
 20000   69.19     1836.2

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        50.6
Local Socket L2->L2 HITM latency        51.3

Second tests with sysbench and with the 48 threads it reaches 195532.90 MiB/sec – full potential of the this single processor system. The first test is with the 48 threads and the second test is with half of them, i.e. the CPUs’ cores 24. The second test is just 133834.17 MiB/sec, which suggests the sysbench did not use the full potential because of NUMA awareness.

srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=48 run
sysbench 1.0.20 (using system LuaJIT 2.1.1731601260)

Running the test with following options:
Number of threads: 48
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 2048000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 2003243250 (200225686.14 per second)

1956292.24 MiB transferred (195532.90 MiB/sec)


General statistics:
    total time:                          10.0032s
    total number of events:              2003243250

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    4.36
         95th percentile:                        0.00
         sum:                               134189.68

Threads fairness:
    events (avg/stddev):           41734234.3750/535217.67
    execution time (avg/stddev):   2.7956/0.02

srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=24 run
sysbench 1.0.20 (using system LuaJIT 2.1.1731601260)

Running the test with following options:
Number of threads: 24
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 2048000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1371023591 (137046191.81 per second)

1338890.23 MiB transferred (133834.17 MiB/sec)


General statistics:
    total time:                          10.0024s
    total number of events:              1371023591

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.05
         95th percentile:                        0.00
         sum:                                62612.40

Threads fairness:
    events (avg/stddev):           57125982.9583/395910.15
    execution time (avg/stddev):   2.6088/0.01

Notes

System cache

Before all tests a drop of the system cache was executed:

echo 0 > /proc/sys/kernel/numa_balancing 
echo 3 > /proc/sys/vm/drop_caches 

Whether dropping the cache or not may have a significant impact on the test results.

NUMA and hardware topology

The NUMA configuration is 2 nodes:

srv ~ $ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 
cpubind: 0 
nodebind: 0 
membind: 0 
preferred:

The hardware CPU, cores, threads, cache and NUMA topology with likwid-topology

srv ~ $ /usr/local/likwid/bin/likwid-topology 
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 6312U CPU @ 2.40GHz
CPU type:       Intel Icelake SP processor
CPU stepping:   6
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:                1
CPU dies:               1
Cores per socket:       24
Threads per core:       2
--------------------------------------------------------------------------------
HWThread        Thread        Core        Die        Socket        Available
0               0             0           0          0             *                
1               0             1           0          0             *                
2               0             2           0          0             *                
3               0             3           0          0             *                
4               0             4           0          0             *                
5               0             5           0          0             *                
6               0             6           0          0             *                
7               0             7           0          0             *                
8               0             8           0          0             *                
9               0             9           0          0             *                
10              0             10          0          0             *                
11              0             11          0          0             *                
12              0             12          0          0             *                
13              0             13          0          0             *                
14              0             14          0          0             *                
15              0             15          0          0             *                
16              0             16          0          0             *                
17              0             17          0          0             *                
18              0             18          0          0             *                
19              0             19          0          0             *                
20              0             20          0          0             *                
21              0             21          0          0             *                
22              0             22          0          0             *                
23              0             23          0          0             *                
24              1             0           0          0             *                
25              1             1           0          0             *                
26              1             2           0          0             *                
27              1             3           0          0             *                
28              1             4           0          0             *                
29              1             5           0          0             *                
30              1             6           0          0             *                
31              1             7           0          0             *                
32              1             8           0          0             *                
33              1             9           0          0             *                
34              1             10          0          0             *                
35              1             11          0          0             *                
36              1             12          0          0             *                
37              1             13          0          0             *                
38              1             14          0          0             *                
39              1             15          0          0             *                
40              1             16          0          0             *                
41              1             17          0          0             *                
42              1             18          0          0             *                
43              1             19          0          0             *                
44              1             20          0          0             *                
45              1             21          0          0             *                
46              1             22          0          0             *                
47              1             23          0          0             *                
--------------------------------------------------------------------------------
Socket 0:               ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:                  1
Size:                   48 kB
Cache groups:           ( 0 24 ) ( 1 25 ) ( 2 26 ) ( 3 27 ) ( 4 28 ) ( 5 29 ) ( 6 30 ) ( 7 31 ) ( 8 32 ) ( 9 33 ) ( 10 34 ) ( 11 35 ) ( 12 36 ) ( 13 37 ) ( 14 38 ) ( 15 39 ) ( 16 40 ) ( 17 41 ) ( 18 42 ) ( 19 43 ) ( 20 44 ) ( 21 45 ) ( 22 46 ) ( 23 47 )
--------------------------------------------------------------------------------
Level:                  2
Size:                   1.25 MB
Cache groups:           ( 0 24 ) ( 1 25 ) ( 2 26 ) ( 3 27 ) ( 4 28 ) ( 5 29 ) ( 6 30 ) ( 7 31 ) ( 8 32 ) ( 9 33 ) ( 10 34 ) ( 11 35 ) ( 12 36 ) ( 13 37 ) ( 14 38 ) ( 15 39 ) ( 16 40 ) ( 17 41 ) ( 18 42 ) ( 19 43 ) ( 20 44 ) ( 21 45 ) ( 22 46 ) ( 23 47 )
--------------------------------------------------------------------------------
Level:                  3
Size:                   36 MB
Cache groups:           ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:           1
--------------------------------------------------------------------------------
Domain:                 0
Processors:             ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 )
Distances:              10
Free memory:            127356 MB
Total memory:           128438 MB
--------------------------------------------------------------------------------

Leave a Reply

Your email address will not be published. Required fields are marked *