Degraded ext4 performance with millions files because of high system time in ext4_mb_good_group (100% kworker/u32+flush)

A strange behavior has been monitored in a storage server holding more than 20 millions of files, which is not a really big figure, because we have storage with more than 300 millions of files. The server has a RAID5 of 4 x 10T HGST Ultrastar He10 and storage partition formatted with ext4 filesystem with total of 27T size. Despite the inodes are only 7% and the free space is at 80%, the server began to experience high load averages with almost no IO utilization across the disks and the RAID device.
So the numbers are:

  • 20% free space.
  • 93% free inodes.
  • no IO disk utilization above 10-20% of any of the disks. In fact, the IO above 10% is very rear.
  • really high loads above the number of the server’s cores. The server has 8 virtual processors under Linux and the loads are constantly above 10-14.
  • no Disk Sleep blocked processes or Running to justify the high load.
  • the top command reports one or two kernel processes with 100% running – kworker/u32:1+flush-9:3 (so flushing the data takes so many time?)

If the problem were related to the disks or hardware, it would probably have been seen a high utilization across the disks, because the physical operations would need more time to execute. The high IO utilization across any disk, in deed, may result in a high system time (i.e. kernel time). Apparently, there is no high IO utilization across the disks and even no Disk Sleep blocked processes.
The FlameGraph tool may help to suggest the area where the problem related to.
First, install the git and perf tool (there are two interesting links about it with examples – link1 and link2):

dnf install -y perf git perl-open
git clone https://github.com/brendangregg/FlameGraph

Save a sample 60 second data with:

[root@srv ~]# perf record -F 99 -a -g -- sleep 60
[ perf record: Woken up 24 times to write data ]
[ perf record: Captured and wrote 7.132 MB perf.data (35687 samples) ]

And generate a SVG image of all the function, which are executed during this time frame of 60 seconds. It will give a good picture of what has been executed and how much time it took in the kernel and user space.

[root@srv ~]# cd FlameGraph
[root@srv FlameGraph]# ./stackcollapse-perf.pl ../out.perf > out.folded
[root@srv FlameGraph]# ./flamegraph.pl out.folded > kernel.svg

The SVG are shared bellow. It turned out most of the time during this sample period of 60 seconds, the Linux kernel was in ext4_mb_good_group, which is related to the EXT4 file system.
What fixed the problem (for now) – By removing files it was noticed the system time went down and the load went down, too. Then, after removing approximately a million files, the server performance returned to normal as the average other storage server (this one has unique hardware setup). So by removing files, i.e. INODES from the ext4 file system, the load average and the server ext4 performance were completely different from what it had been before.
The interesting part the inodes never went above 7% (around 24 million of 454 million available) and the free space never went down 20%, but when a good deal of files were removed, the bad ext4 performance disappeared. It still could be related to the physical disks or some strange ext4 bug, but this article is intended to report this behavior and to propose a some kind of a solution (even though a workaround).

SCREENSHOT 1) The kernel process running at 100% – kworker/u32:2-flush-9:3.

main menu
kernel process running at 100%

SCREENSHOT 2) Almost half of the time was spent at ext4 related calls.

main menu
ext4 and ext4_mb_good_group half of the time spent

Keep on reading!