llama-bench the Qwen3.6 27B and NVIDIA RTX 3090

Author:

main menu
tokens per second

For the LLM model the Alibaba’s Qwen 3.6 27B with different quantization are used to show the difference in token generation per second and memory consumption. Qwen 3.6 27B is relatively small LLM model, but it is extremely good in general AI and software coding. The software engineers could use it not only for sophisticated auto-complete, but also for local agent coding. It is published by the Alibaba giant, and in many cases it can be considered to offload some LLM work locally for free, especially IT. The model is dense with 27B total parameters. Bare in mind, this model is relatively small compared to the bigger one – Qwen 3.5 397B A17B, which is really big LLM with 300 GiB memory for the Q4 at least, but fighting with the very expensive Claude 4.6. The Qwen 3.6 27B is light-weight and dense model, which with enough context data may result in strong agent assistant coding help in the IT. The article is focused only showing the benchmark of the LLM tokens generations per second and there are other papers on the quality of the output for the different quantized version. This model is ideal for even a single NVIDIA RTX 3090 card for local inference even with Q5 or Q6 quantization – perfect for a home use. At present, this cards with RTX 3090 chip are around 1100-1200$. Building a low cost, but with good inference performance workstation is possible with couple of these cards. Here are the speed and the use can be expected. This article includes a BF16 (brain floating point) of the model using triple NVIDIA RTX 3090 24Gb (to load the whole model in the GPU memory), all other tests are with a single card.
With dual or triple NVIDIA RTX 3090 24Gb setup it is enough to load the model weights and a big context windows such as 130 000 or even the full supported context size of 262 144 tokens using not so aggressive quantization such as 8 and 16 bit weights.
The testing bench is:

  • Triple GPU NVIDIA RTX 3090 24Gb – 10496 shading units, 328 tensor cores.
  • 24GB RAM GDDR6, with 384 bit bus width. Total with three cards 72GB GDDR6 memory.
  • Test server – ASUS with single 1950X. Old and cheap CPU, but still good enough for a home lab.
  • 2 cards with Link Speed 16GT/s, Width x16 and 1 card with x4
  • Testing with LLAMA.CPP – llama-bench
  • theoretical memory bandwidth 936.2 GB/s (according to the official documents from NVIDIA)
  • the context window is the default 4K of the llama-bench tool. The memory consumption could vary greatly if context window is increased.
  • Price: around $1100 in ebay.com (Q2 2026).

Here are the results. The first benchmark test is Q4 and is used as a baseline for the diff column below, because Q4 are really popular and they offer a good quality and really small footprint related to the full sized model version.

N model parameters quantization memory diff t/s % tokens/s
1 Qwen3.6 27B 26.90 B Q4_K_M 15.65 GiB 0 39.878
2 Qwen3.6 27B 26.90 B Q5_K_M 18.16 GiB 10.08 35.856
3 Qwen3.6 27B 26.90 B Q6_K 20.97 GiB 11.76 31.636
4 Qwen3.6 27B 26.90 B Q8_0 26.62 GiB 15.28 26.802
5* Qwen3.6 27B 26.90 B BF16 50.10 GiB 42.11 15.514


It’s worth noting when the total output tokens increase, the tokens per second does not decrease such with the MoE models. The above table is generated using the least GPUs needed for the test, so Q4_K_M and Q5_K_M are on a single card, the Q6_K and Q8_0 is on dual cards setup and the last one, the triple cards for the BF16. From Q4 to BF16 the decrease is 61.09%, which is absolutely usable for local AI inference and agentic work.

The difference between the Q4 and BF16 in the tokens per second is 37.36% speed degradation and even the BF16 (brain floating point) with two cards is usable with tokens generation around 20 per second. Around 15 tokens per second is good and usable for daily use for a single user, which is what the GPU inference would offer easily.

Here are all the tests output. There are three benchmark tests – with single RTX 3090, with two and three. Of course, some quantization files could not fit in the memory, so no benchmark tests are available there. The BF16 only loads with three card setup.

1. Qwen 3.6 27B Q4_K_M

Using unsloth Qwen3.6-27B-Q4_K_M.gguf file.
Single NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q4_K_M.gguf 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         40.14 ± 0.08 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         40.06 ± 0.03 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         39.76 ± 0.02 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         39.77 ± 0.03 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         39.66 ± 0.00 |

build: ce8fd4b1a (8777)

Dual NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q4_K_M.gguf
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48250 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         40.61 ± 0.13 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         40.70 ± 0.03 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         40.43 ± 0.02 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         40.31 ± 0.02 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         40.17 ± 0.02 |

build: ce8fd4b1a (8777)

Tripple NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q4_K_M.gguf
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72376 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         39.46 ± 0.10 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         39.62 ± 0.02 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         39.40 ± 0.01 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         39.38 ± 0.02 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         39.25 ± 0.01 |

build: ce8fd4b1a (8777)

So no difference, it’s 39.5~40 tokens/s.

2. Qwen 3.6 27B Q5_K_M

.Using unsloth Qwen3.6-27B-Q5_K_M.gguf file.
Single NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q5_K_M.gguf 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         36.00 ± 0.07 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         36.02 ± 0.03 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         35.79 ± 0.01 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         35.77 ± 0.02 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         35.70 ± 0.00 |

build: ce8fd4b1a (8777)

Dual NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q5_K_M.gguf
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48250 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         36.34 ± 0.10 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         36.41 ± 0.03 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         36.15 ± 0.02 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         36.13 ± 0.01 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         35.98 ± 0.02 |

build: ce8fd4b1a (8777)

Tripple NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q5_K_M.gguf
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72376 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         35.63 ± 0.08 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         35.72 ± 0.03 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         35.55 ± 0.01 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         35.56 ± 0.02 |
| qwen35 27B Q5_K - Medium       |  18.16 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         35.47 ± 0.01 |

build: ce8fd4b1a (8777)

3. Qwen 3.6 27B Q6_K

Using unsloth Qwen3.6-27B-Q6_K.gguf file.
Single NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q6_K.gguf 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         31.80 ± 0.04 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         31.74 ± 0.02 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         31.56 ± 0.01 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         31.57 ± 0.01 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         31.51 ± 0.00 |

build: ce8fd4b1a (8777)

Dual NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q6_K.gguf
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48250 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         32.21 ± 0.08 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         32.21 ± 0.02 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         32.02 ± 0.02 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         31.98 ± 0.01 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         31.84 ± 0.01 |

build: ce8fd4b1a (8777)

Tripple NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q6_K.gguf
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72376 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         31.73 ± 0.07 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         31.77 ± 0.01 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         31.65 ± 0.01 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         31.64 ± 0.02 |
| qwen35 27B Q6_K                |  20.97 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         31.58 ± 0.01 |

build: ce8fd4b1a (8777)

4. Qwen 3.6 27B Q8_0

Using unsloth Qwen3.6-27B-Q8_0.gguf file.
Dual NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q8_0.gguf
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48250 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         26.86 ± 0.09 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         26.89 ± 0.02 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         26.76 ± 0.00 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         26.77 ± 0.01 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         26.73 ± 0.01 |

build: ce8fd4b1a (8777)

Tripple NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-Q8_0.gguf
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72376 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         26.50 ± 0.08 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         26.59 ± 0.01 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         26.50 ± 0.01 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         26.50 ± 0.01 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         26.46 ± 0.01 |

build: ce8fd4b1a (8777)

5. Qwen 3.6 27B BF16

Using unsloth Qwen3.6-27B-BF16-00001-of-00002.gguf and Qwen3.6-27B-BF16-00001-of-00002.gguf files.
Tripple NVIDIA RTX 3090:

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 128,256,512,1024,2048  -m /root/models/unsloth/Qwen3.6-27B-BF16-00001-of-00002.gguf
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72376 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg128 |         15.52 ± 0.01 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg256 |         15.53 ± 0.00 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg512 |         15.51 ± 0.00 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg1024 |         15.51 ± 0.00 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | CUDA,BLAS  |      16 |          tg2048 |         15.50 ± 0.00 |

build: ce8fd4b1a (8777)

Bear in mind that there is an additional latency because of the data transfer between the two/three cards using the PCI-E. Running BF16 is only possible with three cards with 24Gb RAM.

Many tokens output

When the output is large amount of tokens, the average speed of the tokens per second generation is not degradating such as with the MoE models. Here is a test with 32K tokens output and the performance matches the above performance tests!

/root/llama.cpp/build/bin/llama-bench --numa distribute -t 16 -p 0 -n 32768 -m /root/models/unsloth/Qwen3.6-27B-BF16-00001-of-00002.gguf
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 72376 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
| model                          |       size |     params | backend    | threads |              test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ----------------: | -------------------: |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | CUDA,BLAS  |      16 |           tg32768 |         15.51 ± 0.01 |

build: ce8fd4b1a (8777)

Before all tests the cleaning cache commands were executed:

echo 0 > /proc/sys/kernel/numa_balancing
echo 3 > /proc/sys/vm/drop_caches

For more LLM performance benchmarks with llama-bench check out – here.

Leave a Reply

Your email address will not be published. Required fields are marked *