LLM Inference using riser/extender cable and OcuLink cable

Author:

Even building an LLM Inference rig with multiple GPUs may be a change, because the full x16 Riser/Extender cables are really chunky and could interfere the air flow. PCIe Raiser cables are short, thick ribbon cables, which should not be bent. They are really meant only to just adjust one or two video card out of the physical PCIe slots of the computer. What if we want to have 10 or 20 video cards in one computer, or connected to it? We can use Oculink to PCIe 4.0 X4 Adapter, which uses SFF-8611/8612 cable to plug in to connect the PCIe adapter and the PCIE 16X Expansion Card, which the GPU is slotted. The link is limited at most x4, but with the modern motherboard, which supports PCIe Bifurcation a x16 slot could host 4 GPUs using x4x4x4x4 bifurcation and 4 SFF-8611 cables. The cable SFF-8611 itself is not a think ribbon cable, but rather a normal round cable slightly thinker than usual and could be bent without problems. Also, there are 20-30-50-90 centimeters long and they could be used to host the computer’s motherboard in one case and all the GPUs in multiple other cases.
The testing bench is:

  • NVIDIA RTX3090 Founders Edition
  • 24GB VRAM
  • MS73-HB0 with dual Intel CPU 8480+
  • Using modules only within the VRAM – 3 big models – 32 and 30 billions and several small 12-8 billions with different quantization (4Q, 8Q and 16bit).
  • Testing with LLAMA.CPP – llama-bench
  • Oculink to PCIe 4.0 X4 Adapter
  • SFF-8611/8612 cable
  • PCIe PCI-Express 16x 4x Adapter with ATX power. Another one with SATA powered tested, but the GPU dropped out from the Linux system.
  • Linux PCIe data links – Slotted Card: Speed 16GT/s, Width x16; Using PCIe raiser: Speed 2.5GT/s (downgraded), Width x16; Using Oculink: Speed 16GT/s, Width x4 (downgraded)

Here are the results:

N model parameters quantization slot t/s using riser t/s oculink t/s
1 Qwen 2.5 32B Instruct 32B Q4_K_M 34.18 33.524 33.888
2 Qwen 3 QwQ 32B 32B Q4_K_M 33.778 34.124 34.066
3 Qwen Qwen3 30B A3B 30B Q4_K_M 118.182 118.87 116.908
4 Gemma 3 27b it 27B Q4_K_M 37.852 37.142 37.656
5 Gemma 3 12b it 27B Q4_K_M 71.986 71.324 71.784
6 Gemma 3 12b it 27B Q8_0 49.608 49.8 49.462
7 DeepSeek R1 Distill Llama 8B 8B Q4_K_M 119.33 120.592 119.028
8 DeepSeek R1 Distill Llama 8B 8B Q8_0 83.914 85.128 83.56
9 DeepSeek R1 Distill Llama 8B 8B f16 50.906 51.73 50.762
10 Meta Llama 3.1 8B Q4_K_M 120.622 119.566 120.176
11 Meta Llama 3.1 8B Q8_0 84.74 85.186 84.42

It appears the difference of the token per seconds for each test is below 1-2% and even there benchmark tests, which output token per seconds is bigger than the slotted GPU. It does not matter if the LLM inference uses GPU, which is slotted on the PCIe on the motherboard or is using PCIe riser cable/board or x4 Oculink connection.

Output of the tests:

RTX3090 video card slotted directly on the motherboard.

First, Linux reports the GPU link:

a8:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 147d
        Physical Slot: 5
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 266
        NUMA node: 6
        Region 0: Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 20cfe0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 20cff0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at c000 [size=128]
        Expansion ROM at e1000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00058  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x16
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00 v1] Lane Margining at the Receiver
                PortCap: Uses Driver+
                PortSta: MargReady+ MargSoftReady+
        Capabilities: [e00 v1] Data Link Feature <?>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

Second, the benchmark output using llama-bench:

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Qwen2.5-32B-Instruct-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 81
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg128 |         35.18 ± 0.07 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg256 |         34.84 ± 0.06 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg512 |         34.03 ± 0.03 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg1024 |         33.71 ± 0.06 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg2048 |         33.14 ± 0.00 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-27b-it-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 81
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |           tg128 |         38.65 ± 0.25 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |           tg256 |         38.60 ± 0.06 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |           tg512 |         37.82 ± 0.04 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |          tg1024 |         37.37 ± 0.05 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |          tg2048 |         36.82 ± 0.01 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 81
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg128 |         34.63 ± 0.09 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg256 |         34.39 ± 0.14 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg512 |         33.63 ± 0.07 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg1024 |         33.30 ± 0.12 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg2048 |         32.94 ± 0.00 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/Qwen3-30B-A3B-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |           tg128 |        123.29 ± 1.27 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |           tg256 |        122.63 ± 0.26 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |           tg512 |        118.30 ± 0.31 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |          tg1024 |        115.30 ± 0.50 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |          tg2048 |        111.39 ± 0.04 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg128 |         74.28 ± 0.34 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg256 |         73.88 ± 0.36 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg512 |         72.11 ± 0.19 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |          tg1024 |         70.67 ± 0.07 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |          tg2048 |         68.99 ± 0.04 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q8_0.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |           tg128 |         50.68 ± 0.41 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |           tg256 |         50.15 ± 0.43 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |           tg512 |         49.12 ± 0.06 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |          tg1024 |         49.39 ± 0.30 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |          tg2048 |         48.70 ± 0.00 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg128 |        123.10 ± 1.26 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg256 |        122.60 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg512 |        119.26 ± 0.08 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg1024 |        117.30 ± 0.63 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg2048 |        114.39 ± 0.06 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg128 |         85.86 ± 0.25 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg256 |         85.13 ± 0.33 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg512 |         83.92 ± 0.02 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg1024 |         83.01 ± 0.32 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg2048 |         81.65 ± 0.06 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-f16.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |           tg128 |         51.58 ± 0.14 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |           tg256 |         51.45 ± 0.04 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |           tg512 |         50.87 ± 0.01 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |          tg1024 |         50.57 ± 0.12 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |          tg2048 |         50.06 ± 0.03 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg128 |        125.23 ± 0.68 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg256 |        123.86 ± 0.15 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg512 |        120.38 ± 0.11 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg1024 |        118.40 ± 0.58 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg2048 |        115.24 ± 0.02 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg128 |         86.51 ± 0.28 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg256 |         86.40 ± 0.12 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg512 |         84.75 ± 0.03 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg1024 |         83.76 ± 0.29 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg2048 |         82.28 ± 0.02 |

build: 17a1f0d2 (5844)

Leave a Reply

Your email address will not be published. Required fields are marked *