RTX3090 video card suing x4 Oculink to connect to the motherboard’s PCIe slot.
First, Linux reports the GPU link:
38:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 147d Physical Slot: 2 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 266 NUMA node: 2 Region 0: Memory at b1000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 203fe0000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at 203ff0000000 (64-bit, prefetchable) [size=32M] Region 5: I/O ports at 5000 [size=128] Expansion ROM at b2000000 [virtual] [disabled] [size=512K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00058 Data: 0000 Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ TEE-IO- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s, Width x4 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- AtomicOpsCtl: ReqEn- IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq- 10BitTagReq+ OBFF Disabled, EETLPPrefixBlk- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr- PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr- PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr- PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [bb0 v1] Physical Resizable BAR BAR 0: current size: 16MB, supported: 16MB BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB BAR 3: current size: 32MB, supported: 32MB Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?> Capabilities: [d00 v1] Lane Margining at the Receiver PortCap: Uses Driver+ PortSta: MargReady+ MargSoftReady+ Capabilities: [e00 v1] Data Link Feature <?> Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia
Second, the benchmark output using llama-bench:
root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Qwen2.5-32B-Instruct-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 81 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg128 | 34.86 ± 0.07 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg256 | 34.56 ± 0.07 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg512 | 33.80 ± 0.07 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg1024 | 33.37 ± 0.07 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg2048 | 32.85 ± 0.02 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-27b-it-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 81 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg128 | 38.82 ± 0.17 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg256 | 38.44 ± 0.08 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg512 | 37.58 ± 0.04 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg1024 | 37.06 ± 0.03 | | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 81 | tg2048 | 36.38 ± 0.01 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/Qwen/qwq-32b-q4_k_m.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 81 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg128 | 35.06 ± 0.05 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg256 | 34.71 ± 0.04 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg512 | 33.92 ± 0.05 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg1024 | 33.62 ± 0.05 | | qwen2 32B Q4_K - Medium | 18.48 GiB | 32.76 B | CUDA | 81 | tg2048 | 33.02 ± 0.02 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/Qwen3-30B-A3B-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg128 | 121.71 ± 1.57 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg256 | 121.25 ± 0.28 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg512 | 117.07 ± 0.25 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg1024 | 114.20 ± 0.42 | | qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | tg2048 | 110.31 ± 0.07 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg128 | 73.06 ± 0.44 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg256 | 73.61 ± 0.62 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg512 | 72.21 ± 0.05 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg1024 | 70.88 ± 0.15 | | gemma3 12B Q4_K - Medium | 6.79 GiB | 11.77 B | CUDA | 99 | tg2048 | 69.16 ± 0.02 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/unsloth/gemma-3-12b-it-Q8_0.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg128 | 50.51 ± 0.26 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg256 | 50.34 ± 0.12 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg512 | 49.46 ± 0.09 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg1024 | 48.80 ± 0.03 | | gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | CUDA | 99 | tg2048 | 48.20 ± 0.19 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg128 | 123.19 ± 0.95 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg256 | 122.31 ± 0.08 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg512 | 118.90 ± 0.10 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg1024 | 116.88 ± 0.61 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg2048 | 113.86 ± 0.10 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 85.39 ± 0.38 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg256 | 85.15 ± 0.04 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg512 | 83.42 ± 0.17 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg1024 | 82.62 ± 0.33 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg2048 | 81.22 ± 0.06 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/DeepSeek-R1-Distill-Llama-8B-f16.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg128 | 51.43 ± 0.08 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg256 | 51.34 ± 0.02 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg512 | 50.73 ± 0.01 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg1024 | 50.41 ± 0.12 | | llama 8B F16 | 14.96 GiB | 8.03 B | CUDA | 99 | tg2048 | 49.90 ± 0.04 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg128 | 124.86 ± 0.57 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg256 | 123.48 ± 0.16 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg512 | 119.98 ± 0.05 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg1024 | 117.91 ± 0.52 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | CUDA | 99 | tg2048 | 114.65 ± 0.05 | build: 17a1f0d2 (5844) root@srv ~ # /root/llama.cpp/build/bin/llama-bench --numa distribute -m /root/models/bartowski/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -t 112 -p 0 -n 128,256,512,1024,2048 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 86.43 ± 0.31 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg256 | 86.00 ± 0.11 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg512 | 84.40 ± 0.02 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg1024 | 83.43 ± 0.29 | | llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg2048 | 81.84 ± 0.04 | build: 17a1f0d2 (5844)