Ryzen Threadripper kernel problems with PCIe – wifi, network, video problems

If you have Ryzen Threadripper (in my case 1950X) and Ubuntu 17.10 (because the others versions do not boot at all!) with the latest kernel 4.15.x and the default vmlinuz-4.13.0-21-generic you might experience frequently wifi, network cards and video cards problems like wifi disconnects and network cards lost connections and some kind of freezes for a second or two of the video! If you have Ryzen Threadripper 1950X and or ASUS ROG ZENITH EXTREME (or any other motherboard with the chipset AMD X399) and ubuntu 17.10 or probably any linux distro with any kernel you might experience really unstable system. Look at your dmesg and if you see something like

[   18.669950] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   18.669966] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   18.669970] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   18.669973] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   18.669975] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[   23.805640] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   23.805664] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   23.805669] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   23.805674] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   23.805678] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[   23.838698] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   23.838716] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   23.838720] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   23.838725] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   23.838728] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[   36.124101] atlantic: link change old 0 new 100
[   36.124222] IPv6: ADDRCONF(NETDEV_CHANGE): enp65s0: link becomes ready
[   52.294413] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   52.294436] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   52.294441] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   52.294446] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   52.294449] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[   53.275256] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   53.275289] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   53.275293] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   53.275298] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   53.275301] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[   53.418493] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   53.418517] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   53.418522] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   53.418527] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   53.418530] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[   53.649938] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[   53.649962] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[   53.649967] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[   53.649972] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[   53.649975] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[  113.294411] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  113.294434] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  113.294439] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[  113.294444] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
[  113.294450] pcieport 0000:00:01.1:    [ 6] Bad TLP               
[  265.469397] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  265.469419] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  265.469424] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
[  265.469429] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
[  265.469438] pcieport 0000:00:01.1:    [12] Replay Timer Timeout  
[  533.031982] dpc 0000:00:01.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
[  533.032004] pcieport 0000:00:01.1: AER: Corrected error received: id=0000
[  533.032009] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
[  533.032014] pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000080/00006000
[  533.032022] pcieport 0000:00:01.1:    [ 7] Bad DLLP      

Apparently everything on the pci express could be affected! So till the kernel team fixes the issues, you can use the following workaround: add to the kernel boot parameters

pcie_aspm=off

Under Ubuntu you can add in the file

/etc/default/grub

the following line

GRUB_CMDLINE_LINUX="pcie_aspm=off"

And then execute

update-grub

the changes to take effect in the grub configuration. Then reboot the machine!

Under CentOS 7 the configuration file and the line to put are the same just the update of the grub configuration is different:

  • For UEFI-based systems execute:
    grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
    
  • For legacy-BIOS systems execute:
    grub2-mkconfig -o /boot/grub2/grub.cfg
    

*You should know what is your setup, probably UEFI-based, you could check in your boot directory, if there is /boot/efi/EFI/centos/grub.cfg you probably should use the option for UEFI-based, if you do not have any “efi” directory under “/boot” use the second option for legacy-BIOS.
Then reboot the machine!