Even building an LLM Inference rig with multiple GPUs may be a change, because the full x16 Riser/Extender cables are really chunky and could interfere the air flow. PCIe Raiser cables are short, thick ribbon cables, which should not be bent. They are really meant only to just adjust one or two video card out of the physical PCIe slots of the computer. What if we want to have 10 or 20 video cards in one computer, or connected to it? We can use Oculink to PCIe 4.0 X4 Adapter, which uses SFF-8611/8612 cable to plug in to connect the PCIe adapter and the PCIE 16X Expansion Card, which the GPU is slotted. The link is limited at most x4, but with the modern motherboard, which supports PCIe Bifurcation a x16 slot could host 4 GPUs using x4x4x4x4 bifurcation and 4 SFF-8611 cables. The cable SFF-8611 itself is not a think ribbon cable, but rather a normal round cable slightly thinker than usual and could be bent without problems. Also, there are 20-30-50-90 centimeters long and they could be used to host the computer’s motherboard in one case and all the GPUs in multiple other cases.
The testing bench is:
- NVIDIA RTX3090 Founders Edition
- 24GB VRAM
- MS73-HB0 with dual Intel CPU 8480+
- Using modules only within the VRAM – 3 big models – 32 and 30 billions and several small 12-8 billions with different quantization (4Q, 8Q and 16bit).
- Testing with LLAMA.CPP – llamacpp-bench
- Oculink to PCIe 4.0 X4 Adapter
- SFF-8611/8612 cable
- PCIe PCI-Express 16x 4x Adapter with ATX power. Another one with SATA powered tested, but the GPU dropped out from the Linux system.
- Linux PCIe data links – Slotted Card: Speed 16GT/s, Width x16; Using PCIe raiser: Speed 2.5GT/s (downgraded), Width x16; Using Oculink: Speed 16GT/s, Width x4 (downgraded)
Here are the results:
N | model | parameters | quantization | slot t/s | using riser t/s | oculink t/s |
---|---|---|---|---|---|---|
1 | Qwen 2.5 32B Instruct | 32B | Q4_K_M | 34.18 | 33.524 | 33.888 |
2 | Qwen 3 QwQ 32B | 32B | Q4_K_M | 33.778 | 34.124 | 34.066 |
3 | Qwen Qwen3 30B A3B | 30B | Q4_K_M | 118.182 | 118.87 | 116.908 |
4 | Gemma 3 27b it | 27B | Q4_K_M | 37.852 | 37.142 | 37.656 |
5 | Gemma 3 12b it | 27B | Q4_K_M | 71.986 | 71.324 | 71.784 |
6 | Gemma 3 12b it | 27B | Q8_0 | 49.608 | 49.8 | 49.462 |
7 | DeepSeek R1 Distill Llama 8B | 8B | Q4_K_M | 119.33 | 120.592 | 119.028 |
8 | DeepSeek R1 Distill Llama 8B | 8B | Q8_0 | 83.914 | 85.128 | 83.56 |
9 | DeepSeek R1 Distill Llama 8B | 8B | f16 | 50.906 | 51.73 | 50.762 |
10 | Meta Llama 3.1 | 8B | Q4_K_M | 120.622 | 119.566 | 120.176 |
11 | Meta Llama 3.1 | 8B | Q8_0 | 84.74 | 85.186 | 84.42 |
It appears the difference of the token per seconds for each test is below 1-2% and even there benchmark tests, which output token per seconds is bigger than the slotted GPU. It does not matter if the LLM inference uses GPU, which is slotted on the PCIe on the motherboard or is using PCIe riser cable/board or x4 Oculink connection.
Output of the tests:
RTX3090 video card slotted directly on the motherboard.
First, Linux reports the GPU link:
a8:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 147d Physical Slot: 5 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 266 NUMA node: 6 Region 0: Memory at e0000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 20cfe0000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at 20cff0000000 (64-bit, prefetchable) [size=32M] Region 5: I/O ports at c000 [size=128] Expansion ROM at e1000000 [virtual] [disabled] [size=512K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00058 Data: 0000 Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ TEE-IO- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- AtomicOpsCtl: ReqEn- IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq- 10BitTagReq+ OBFF Disabled, EETLPPrefixBlk- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr- PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr- PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr- PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [bb0 v1] Physical Resizable BAR BAR 0: current size: 16MB, supported: 16MB BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB BAR 3: current size: 32MB, supported: 32MB Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?> Capabilities: [d00 v1] Lane Margining at the Receiver PortCap: Uses Driver+ PortSta: MargReady+ MargSoftReady+ Capabilities: [e00 v1] Data Link Feature <?> Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia
Second, the benchmark output using llama-bench:
root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Qwen2.5-32B-Instruct-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 81 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg128 | 35.18 ± 0.07 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg256 | 34.84 ± 0.06 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg512 | 34.03 ± 0.03 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg1024 | 33.71 ± 0.06 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg2048 | 33.14 ± 0.00 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-27b-it-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 81 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg128 | 38.65 ± 0.25 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg256 | 38.60 ± 0.06 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg512 | 37.82 ± 0.04 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg1024 | 37.37 ± 0.05 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg2048 | 36.82 ± 0.01 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 81 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg128 | 34.63 ± 0.09 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg256 | 34.39 ± 0.14 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg512 | 33.63 ± 0.07 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg1024 | 33.30 ± 0.12 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg2048 | 32.94 ± 0.00 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/Qwen3-30B-A3B-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg128 | 123.29 ± 1.27 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg256 | 122.63 ± 0.26 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg512 | 118.30 ± 0.31 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg1024 | 115.30 ± 0.50 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg2048 | 111.39 ± 0.04 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg128 | 74.28 ± 0.34 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg256 | 73.88 ± 0.36 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg512 | 72.11 ± 0.19 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg1024 | 70.67 ± 0.07 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg2048 | 68.99 ± 0.04 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q8_0.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg128 | 50.68 ± 0.41 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg256 | 50.15 ± 0.43 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg512 | 49.12 ± 0.06 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg1024 | 49.39 ± 0.30 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg2048 | 48.70 ± 0.00 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg128 | 123.10 ± 1.26 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg256 | 122.60 ± 0.10 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg512 | 119.26 ± 0.08 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg1024 | 117.30 ± 0.63 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg2048 | 114.39 ± 0.06 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 85.86 ± 0.25 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg256 | 85.13 ± 0.33 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg512 | 83.92 ± 0.02 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg1024 | 83.01 ± 0.32 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg2048 | 81.65 ± 0.06 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-f16.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg128 | 51.58 ± 0.14 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg256 | 51.45 ± 0.04 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg512 | 50.87 ± 0.01 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg1024 | 50.57 ± 0.12 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg2048 | 50.06 ± 0.03 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg128 | 125.23 ± 0.68 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg256 | 123.86 ± 0.15 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg512 | 120.38 ± 0.11 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg1024 | 118.40 ± 0.58 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg2048 | 115.24 ± 0.02 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 86.51 ± 0.28 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg256 | 86.40 ± 0.12 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg512 | 84.75 ± 0.03 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg1024 | 83.76 ± 0.29 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg2048 | 82.28 ± 0.02 | build: 17a1f0d2 (5844)