LLM Inference using riser/extender cable and OcuLink cable

Author:

RTX3090 video card suing x4 Oculink to connect to the motherboard’s PCIe slot.

First, Linux reports the GPU link:

38:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 147d
        Physical Slot: 2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 266
        NUMA node: 2
        Region 0: Memory at b1000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 203fe0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 203ff0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 5000 [size=128]
        Expansion ROM at b2000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00058  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq+ OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00 v1] Lane Margining at the Receiver
                PortCap: Uses Driver+
                PortSta: MargReady+ MargSoftReady+
        Capabilities: [e00 v1] Data Link Feature <?>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

Second, the benchmark output using llama-bench:

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Qwen2.5-32B-Instruct-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 81
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg128 |         34.86 ± 0.07 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg256 |         34.56 ± 0.07 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg512 |         33.80 ± 0.07 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg1024 |         33.37 ± 0.07 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg2048 |         32.85 ± 0.02 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-27b-it-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 81
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |           tg128 |         38.82 ± 0.17 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |           tg256 |         38.44 ± 0.08 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |           tg512 |         37.58 ± 0.04 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |          tg1024 |         37.06 ± 0.03 |
| gemma3 27B Q4_K - Medium       |  15.40 GiB |    27.01 B | CUDA       |  81 |          tg2048 |         36.38 ± 0.01 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 81
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg128 |         35.06 ± 0.05 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg256 |         34.71 ± 0.04 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |           tg512 |         33.92 ± 0.05 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg1024 |         33.62 ± 0.05 |
| qwen2 32B Q4_K - Medium        |  18.48 GiB |    32.76 B | CUDA       |  81 |          tg2048 |         33.02 ± 0.02 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/Qwen3-30B-A3B-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |           tg128 |        121.71 ± 1.57 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |           tg256 |        121.25 ± 0.28 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |           tg512 |        117.07 ± 0.25 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |          tg1024 |        114.20 ± 0.42 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB |    30.53 B | CUDA       |  99 |          tg2048 |        110.31 ± 0.07 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg128 |         73.06 ± 0.44 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg256 |         73.61 ± 0.62 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |           tg512 |         72.21 ± 0.05 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |          tg1024 |         70.88 ± 0.15 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |          tg2048 |         69.16 ± 0.02 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q8_0.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |           tg128 |         50.51 ± 0.26 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |           tg256 |         50.34 ± 0.12 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |           tg512 |         49.46 ± 0.09 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |          tg1024 |         48.80 ± 0.03 |
| gemma3 12B Q8_0                |  11.64 GiB |    11.77 B | CUDA       |  99 |          tg2048 |         48.20 ± 0.19 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg128 |        123.19 ± 0.95 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg256 |        122.31 ± 0.08 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg512 |        118.90 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg1024 |        116.88 ± 0.61 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg2048 |        113.86 ± 0.10 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg128 |         85.39 ± 0.38 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg256 |         85.15 ± 0.04 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg512 |         83.42 ± 0.17 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg1024 |         82.62 ± 0.33 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg2048 |         81.22 ± 0.06 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-f16.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |           tg128 |         51.43 ± 0.08 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |           tg256 |         51.34 ± 0.02 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |           tg512 |         50.73 ± 0.01 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |          tg1024 |         50.41 ± 0.12 |
| llama 8B F16                   |  14.96 GiB |     8.03 B | CUDA       |  99 |          tg2048 |         49.90 ± 0.04 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg128 |        124.86 ± 0.57 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg256 |        123.48 ± 0.16 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |           tg512 |        119.98 ± 0.05 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg1024 |        117.91 ± 0.52 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |          tg2048 |        114.65 ± 0.05 |

build: 17a1f0d2 (5844)

root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -t 112  -p 0 -n 128,256,512,1024,2048 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg128 |         86.43 ± 0.31 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg256 |         86.00 ± 0.11 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |           tg512 |         84.40 ± 0.02 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg1024 |         83.43 ± 0.29 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | CUDA       |  99 |          tg2048 |         81.84 ± 0.04 |

build: 17a1f0d2 (5844)

Leave a Reply

Your email address will not be published. Required fields are marked *