LLMs or large language models are really popular these days and many people and organization begin to rely on them. This article continues in the spirit of the CPU only benchmarks in the realm of LLM inteference. Check out the other article on the subject with a much better and expensive processor – LLM inference benchmarks with llamacpp and AMD EPYC 9554 CPU Several times cheaper is the setup presented in this article along with the AMD option here – LLM inference benchmarks with llamacpp and AMD EPYC 7282 CPU.
“Run it yourself” in your home or within business organization is always a way more secure and privacy safe than the cloud based AI chat bots/assistants/help. There are many open source LLMs, which will hold their owns on accuracy (intelligence?) and performance against the big ones as OpenAI, Google Gemini and more.
The purpose of the article is to show the performance of the 3th generation Intel processor Xeon Gold 6312U with 24 cores in a single socket board using 8 memory channels of DDR4 3200 MHz. The main testing software is llama.cpp with llama-bench.
Benchmark Results
Here are the benchmark results, which are summarized from the tests below.
N | model | parameters | quantization | tokens per second |
---|---|---|---|---|
1 | DeepSeek R1 Llama | 8B | Q4_K_M | 21.25 |
2 | DeepSeek R1 Llama | 70B | Q4_K_M | 2.74 |
4 | Qwen – QwQ-32B | 32B | Q4_K_M | 5.67 |
5 | Llama 3.1 8B Instruct | 8B | Q4_K_M | 21.20 |
6 | Llama 3.3 70B Instruct | 70B | Q4_K_M | 2.74 |
Below the llama-bench output from the benchmarks, there are pure memory tests for this setup with mlc tool (Intel® Memory Latency Checker v3.11b) and sysbench memory.
Hardware – what to expect from the Intel Xeon Gold 6312U
- Intel Xeon Gold 6312U – 24 cores / 48 threads CPU – a 8 memory channel processor with memory bandwidth around 170 GB/s
- 128G RAM total RAM, all memory channels are utilized – 1 x 8 channels.
- Supermicro – SPC621D8U-2T single socket board with 8 memory slots.
- 8 slots with 8 x 16G DDR4 Samsung 3200Mhz (NT16GA72D8PFX3K-JR)
- Intel Icelake SP architecture
- CPU dies: 1 per CPU
- With similar single socket motherboard (with 128G RAM) the price in eBay Quarter 1 2025 – $1800 ~ $2000 USD.
Software
All tests are made under Linux – Gentoo Linux.
- Gentoo Linux, everything built with “-native”
- Linux kernel – 6.13.5 (gentoo-kernel package)
- GNU GCC – gcc version 14.2.1 20241221
- Glibc – 2.41
Testing
Three main tests are going to be presented here using the llama.cpp for LLM inference.
- Deepseek R1 – Distill Llama 70B and Distill Llama 8B – Q4
- Qwen – QwQ-32B – Q4
- meta-llama – Llama 3.3 70B Instruct and Llama 3.1 8B Instruct – Q4
Testing benchmark with Deepseek R1 Distill Llama-70B
1. Deepseek R1 Distill Llama 70B
First test uses the quantization 4 (Q4_K_M) with 70B Deepseek R1 Distill Llama and the files are downloaded from https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF and it is used in ollama by default.
srv ~ $ /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg128 | 2.77 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg256 | 2.76 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg512 | 2.74 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg1024 | 2.72 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg2048 | 2.69 ± 0.00 | build: dfd6b2c0 (4818)
The generated speed is 2.7 tokens per second, which is not pretty fast for processor only set up. It is really questionable whether it is usable.
2. Deepseek R1 Distill Llama 8B
When using the (Q4_K_M) with 8B Deepseek R1 Distill Llama the speed is 7 times more quickly than the 80B. So a 7.7 times less model parameters 7.5 times more quickly generated tokens. 21 tokens per second is fast for generating a text for daily routines. The file was downloaded from here – huggingface.co.
srv ~ /root/llama.cpp/build/bin/llama-bench -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg128 | 21.84 ± 0.02 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg256 | 21.67 ± 0.02 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg512 | 21.38 ± 0.02 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg1024 | 20.99 ± 0.00 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg2048 | 20.39 ± 0.00 | build: dfd6b2c0 (4818)
3. The Qwen model QwQ-32B developed by Alibaba Cloud
The Qwen modes’ GUFFs could be downloaded from https://huggingface.co/Qwen.
srv ~ $ /root/llama.cpp/build/bin/llama-bench -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 24 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 24 | tg128 | 5.76 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 24 | tg256 | 5.74 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 24 | tg512 | 5.70 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 24 | tg1024 | 5.62 ± 0.00 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | BLAS,RPC | 24 | tg2048 | 5.52 ± 0.00 | build: dfd6b2c0 (4818)
5-6 tokens per second for the new QwQ-32B model. So probably the model is usable on this processor for daily routines.
4. Meta Llama 3.3 70B Instruct
The Meta open source model Llama is widely used and here are the benchmark with Llama 3.3 70B Instruct Q4_K_M
srv ~ $ /root/llama.cpp/build/bin/llama-bench -m /root/models/MaziyarPanahi/Llama-3.3-70B-Instruct.Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg128 | 2.78 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg256 | 2.77 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg512 | 2.75 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg1024 | 2.73 ± 0.00 | | llama 70B Q4_K - Medium | 39.59 GiB | 70.55 B | BLAS,RPC | 24 | tg2048 | 2.69 ± 0.00 | build: dfd6b2c0 (4818)
5. Meta Llama 3.1 8B Instruct
A smaller model of the Meta Llama family. The results are around 22 tokens per second, which is fast enough for code generation. The GGUF file was downloaded from here – huggingface.co.
srv ~ $ /root/llama.cpp/build/bin/llama-bench -m /root/models/Meta-Llama/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 24 -p 0 -n 128,256,512,1024,2048 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg128 | 21.80 ± 0.03 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg256 | 21.62 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg512 | 21.32 ± 0.02 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg1024 | 20.92 ± 0.01 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | BLAS,RPC | 24 | tg2048 | 20.35 ± 0.00 | build: dfd6b2c0 (4818)
Memory bandwidth tests
First tests are with the Intel® Memory Latency Checker v3.11b and the results are that the mlc command-line manages to read almost 173GB/s and for the single CPU.
srv ~/mlc_v3/Linux # ./mlc --memory_bandwidth_scan Intel(R) Memory Latency Checker - v3.11b Command line parameters: --memory_bandwidth_scan Running memory bandwidth scan using 48 threads on numa node 0 accessing memory on numa node 0 Reserved 54 1GB pages Now allocating 54 1GB pages. This may take several minutes.. 1GB page allocation completed Allocating remaining 74230536 KB memory in 4KB pages Totally 122 GB memory allocated in 4KB+1GB pages on NUMA node 0 Measuring memory bandwidth for each of those 1GB memory regions.. Histogram report of BW in MB/sec across each 1GB region on NUMA node 0 BW_range(MB/sec) #_of_1GB_regions ---------------- ---------------- [170000-174999] 3 [175000-179999] 119 Detailed BW report for each 1GB region allocated as contiguous 1GB page on NUMA node 0 phys_addr GBaligned_page# MB/sec --------- --------------- ------ 0x440000000 17 173315 0x600000000 24 175668 0x640000000 25 175367 0x680000000 26 176248 0x6c0000000 27 175893 0x700000000 28 175804 0x740000000 29 175577 0x780000000 30 175870 0x7c0000000 31 175855 0x800000000 32 175587 0x840000000 33 175792 0x880000000 34 175772 0x8c0000000 35 175392 0x900000000 36 175680 0x940000000 37 175818 0x980000000 38 175693 0x9c0000000 39 175925 0xa00000000 40 175519 0xa40000000 41 175834 0xa80000000 42 175712 0xac0000000 43 175803 0xb40000000 45 176032 0xb80000000 46 175351 0xbc0000000 47 175613 0xc00000000 48 175305 0xc40000000 49 175619 0xcc0000000 51 175987 0xd00000000 52 176226 0xd80000000 54 175636 0xdc0000000 55 176033 0xe00000000 56 176087 0xe40000000 57 175848 0xe80000000 58 176803 0xf00000000 60 175517 0xf40000000 61 176320 0x1080000000 66 175739 0x1100000000 68 175487 0x11c0000000 71 175803 0x1280000000 74 175875 0x12c0000000 75 175990 0x1300000000 76 176194 0x1340000000 77 176459 0x1400000000 80 175514 0x1440000000 81 175833 0x1480000000 82 175455 0x1500000000 84 175898 0x1540000000 85 175807 0x1580000000 86 175335 0x15c0000000 87 175461 0x1640000000 89 175563 0x1680000000 90 175595 0x1a40000000 105 175343 0x1dc0000000 119 175726 0x1e80000000 122 175324 Detailed BW report for each 1GB region allocated as 4KB page on NUMA node 0 phys_addr MB/sec --------- ------ 0xb0e7cd000 174968 0x1fffc6c000 174787 0x4d8995000 175322 0x2a791f000 175799 0x17dea8000 175319 0x4d7ab2000 176422 0x1ae3dbb000 177073 0x245dc4000 176582 0x1ac41ce000 176018 0x5af9d7000 175776 0x19201e1000 176089 0x18979ea000 175826 0x185f1f3000 175665 0x19561fd000 175805 0x1016e06000 176435 0x17fbe0f000 175859 0x17daa19000 176015 0x1761222000 175788 0x43fa2c000 175556 0xb10635000 176373 0x13e363e000 176616 0x3e9a48000 175985 0x575651000 176050 0x19d765b000 176468 0x195ca64000 176742 0x1ab426d000 175656 0x53c277000 175751 0xd56680000 176120 0x522a89000 176069 0x16d4693000 176231 0x58be9c000 176467 0x163e11c000 175962 0x279125000 176164 0x5cb22a000 175479 0x3b3235000 176027 0x59fe3e000 175802 0x13ab2c6000 176248 0xfb5acf000 175765 0x10ce6d8000 175744 0x1068db8000 176166 0x101b5c1000 176211 0x1fe19cb000 175696 0x1f399d4000 176518 0x1eca5dd000 176649 0x1da81e7000 176601 0x1c461f0000 176032 0x1c645f9000 176610 0x1d1ca03000 176076 0x1c9e20c000 176076 0x1bb6216000 176309 0x2c021f000 176642 0x496228000 176056 0xc82a32000 175401 0xf9ea3b000 175818 0x1045245000 175947 0x1195a8d000 175108 0x122a696000 175826 0x14cbaa0000 175953 0x16c56a9000 175988 0x178b2b2000 176711 0x19076bc000 175807 0x1ac12c5000 175802 0x1bc9acf000 175786 0x1c1c6d8000 175408 0x1d072e1000 175594 0x1e1f2eb000 175803 0x1f136f4000 175692 0x1f64afe000 175488
And just for the record, mlc executed without arguments.
srv ~/mlc_v3/Linux $ ./mlc Intel(R) Memory Latency Checker - v3.11b Measuring idle latencies for sequential access (in ns)... Numa node Numa node 0 0 134.7 Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 173464.8 3:1 Reads-Writes : 155992.6 2:1 Reads-Writes : 152662.5 1:1 Reads-Writes : 128919.2 Stream-triad like: 160139.3 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 0 173251.7 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 189.48 172439.0 00002 189.91 172130.1 00008 188.10 170764.5 00015 182.81 169457.8 00050 165.67 161445.8 00100 134.49 141202.0 00200 99.04 75465.1 00300 93.10 56414.2 00400 88.26 44538.6 00500 84.69 36080.3 00700 81.36 26254.1 01000 78.87 18758.2 01300 77.71 14681.6 01700 76.24 11466.5 02500 73.84 8114.4 03500 72.84 6065.9 05000 71.98 4525.1 09000 70.25 2934.0 20000 69.19 1836.2 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 50.6 Local Socket L2->L2 HITM latency 51.3
Second tests with sysbench and with the 48 threads it reaches 195532.90 MiB/sec – full potential of the this single processor system. The first test is with the 48 threads and the second test is with half of them, i.e. the CPUs’ cores 24. The second test is just 133834.17 MiB/sec, which suggests the sysbench did not use the full potential because of NUMA awareness.
srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=48 run sysbench 1.0.20 (using system LuaJIT 2.1.1731601260) Running the test with following options: Number of threads: 48 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 2048000MiB operation: read scope: global Initializing worker threads... Threads started! Total operations: 2003243250 (200225686.14 per second) 1956292.24 MiB transferred (195532.90 MiB/sec) General statistics: total time: 10.0032s total number of events: 2003243250 Latency (ms): min: 0.00 avg: 0.00 max: 4.36 95th percentile: 0.00 sum: 134189.68 Threads fairness: events (avg/stddev): 41734234.3750/535217.67 execution time (avg/stddev): 2.7956/0.02 srv ~ $ sysbench memory --memory-oper=read --memory-block-size=1K --memory-total-size=2000G --threads=24 run sysbench 1.0.20 (using system LuaJIT 2.1.1731601260) Running the test with following options: Number of threads: 24 Initializing random number generator from current time Running memory speed test with the following options: block size: 1KiB total size: 2048000MiB operation: read scope: global Initializing worker threads... Threads started! Total operations: 1371023591 (137046191.81 per second) 1338890.23 MiB transferred (133834.17 MiB/sec) General statistics: total time: 10.0024s total number of events: 1371023591 Latency (ms): min: 0.00 avg: 0.00 max: 0.05 95th percentile: 0.00 sum: 62612.40 Threads fairness: events (avg/stddev): 57125982.9583/395910.15 execution time (avg/stddev): 2.6088/0.01
Notes
System cache
Before all tests a drop of the system cache was executed:
echo 0 > /proc/sys/kernel/numa_balancing
echo 3 > /proc/sys/vm/drop_caches
echo 0 > /proc/sys/kernel/numa_balancing echo 3 > /proc/sys/vm/drop_caches
Whether dropping the cache or not may have a significant impact on the test results.
NUMA and hardware topology
The NUMA configuration is 2 nodes:
srv ~ $ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0
nodebind: 0
membind: 0
preferred:
srv ~ $ numactl -s policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 cpubind: 0 nodebind: 0 membind: 0 preferred:
The hardware CPU, cores, threads, cache and NUMA topology with likwid-topology
srv ~ $ /usr/local/likwid/bin/likwid-topology -------------------------------------------------------------------------------- CPU name: Intel(R) Xeon(R) Gold 6312U CPU @ 2.40GHz CPU type: Intel Icelake SP processor CPU stepping: 6 ******************************************************************************** Hardware Thread Topology ******************************************************************************** Sockets: 1 CPU dies: 1 Cores per socket: 24 Threads per core: 2 -------------------------------------------------------------------------------- HWThread Thread Core Die Socket Available 0 0 0 0 0 * 1 0 1 0 0 * 2 0 2 0 0 * 3 0 3 0 0 * 4 0 4 0 0 * 5 0 5 0 0 * 6 0 6 0 0 * 7 0 7 0 0 * 8 0 8 0 0 * 9 0 9 0 0 * 10 0 10 0 0 * 11 0 11 0 0 * 12 0 12 0 0 * 13 0 13 0 0 * 14 0 14 0 0 * 15 0 15 0 0 * 16 0 16 0 0 * 17 0 17 0 0 * 18 0 18 0 0 * 19 0 19 0 0 * 20 0 20 0 0 * 21 0 21 0 0 * 22 0 22 0 0 * 23 0 23 0 0 * 24 1 0 0 0 * 25 1 1 0 0 * 26 1 2 0 0 * 27 1 3 0 0 * 28 1 4 0 0 * 29 1 5 0 0 * 30 1 6 0 0 * 31 1 7 0 0 * 32 1 8 0 0 * 33 1 9 0 0 * 34 1 10 0 0 * 35 1 11 0 0 * 36 1 12 0 0 * 37 1 13 0 0 * 38 1 14 0 0 * 39 1 15 0 0 * 40 1 16 0 0 * 41 1 17 0 0 * 42 1 18 0 0 * 43 1 19 0 0 * 44 1 20 0 0 * 45 1 21 0 0 * 46 1 22 0 0 * 47 1 23 0 0 * -------------------------------------------------------------------------------- Socket 0: ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 ) -------------------------------------------------------------------------------- ******************************************************************************** Cache Topology ******************************************************************************** Level: 1 Size: 48 kB Cache groups: ( 0 24 ) ( 1 25 ) ( 2 26 ) ( 3 27 ) ( 4 28 ) ( 5 29 ) ( 6 30 ) ( 7 31 ) ( 8 32 ) ( 9 33 ) ( 10 34 ) ( 11 35 ) ( 12 36 ) ( 13 37 ) ( 14 38 ) ( 15 39 ) ( 16 40 ) ( 17 41 ) ( 18 42 ) ( 19 43 ) ( 20 44 ) ( 21 45 ) ( 22 46 ) ( 23 47 ) -------------------------------------------------------------------------------- Level: 2 Size: 1.25 MB Cache groups: ( 0 24 ) ( 1 25 ) ( 2 26 ) ( 3 27 ) ( 4 28 ) ( 5 29 ) ( 6 30 ) ( 7 31 ) ( 8 32 ) ( 9 33 ) ( 10 34 ) ( 11 35 ) ( 12 36 ) ( 13 37 ) ( 14 38 ) ( 15 39 ) ( 16 40 ) ( 17 41 ) ( 18 42 ) ( 19 43 ) ( 20 44 ) ( 21 45 ) ( 22 46 ) ( 23 47 ) -------------------------------------------------------------------------------- Level: 3 Size: 36 MB Cache groups: ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 ) -------------------------------------------------------------------------------- ******************************************************************************** NUMA Topology ******************************************************************************** NUMA domains: 1 -------------------------------------------------------------------------------- Domain: 0 Processors: ( 0 24 1 25 2 26 3 27 4 28 5 29 6 30 7 31 8 32 9 33 10 34 11 35 12 36 13 37 14 38 15 39 16 40 17 41 18 42 19 43 20 44 21 45 22 46 23 47 ) Distances: 10 Free memory: 127356 MB Total memory: 128438 MB --------------------------------------------------------------------------------