High load average

On one of our Proxmox 6 servers we recently had the issue of Icinga constantly complaining about a high load average. The load average itself is a bit of a weird metric, it measures the length of the run queue. That means how many processes are waiting to get run - this could either be because they are waiting for CPU time to become available or data to arrive from the disk. It doesn't count tasks that are delibertatly sleep()ing or waiting for network I/O (unless it's an NFS mount, in which case it's considered disk I/O, not network I/O). Basically, if it's too high it tells your there's something wrong but you have to dig deeper to figure out exactly what.
So, on our system top shows a CPU utilization of around 10%, iotop just a few kilobytes up to a few megabyte of disk activity per second... What could it be? There's one command to find out - it shows the processes actually in the run queue, so we can get a better idea of what's stuck. Let's run it:

# ps r -e

  PID TTY      STAT   TIME COMMAND
  442 ?        D      0:00 [z_unlinked_drai]
10102 ?        D      0:01 [z_unlinked_drai]
14311 ?        D      0:00 zfs recv -F -- rpool/data/subvol-100-disk-0
18114 pts/9    R+     0:00 ps r -e
19302 ?        D      0:00 zfs recv -F -- rpool/data/subvol-101-disk-0

Besides the "ps" process itself we're seeing two ZFS userspace processes and two ZFS kernel threads. They are all coming from the Proxmox replication services. It seems to be a bug in zfsonlinux - too bad :/. You can't kill processes stuck in an uninterruptible syscall or kernel threads.
Reading up on it doesn't show a clear reason for the behaviour. We're running our ZFS on spinning rust with two NVMe disks for l2arc and zil. Swap is also on NVMe.
The only thing to do is install all updates, reboot the server and hope it doesn't happen again.

 

Update, March 2021: The bug in zfsonlinux still isn't fixed. We are in the process of moving  many of our VMs to CEPH storage on NVMe drives for perfomance reasons. We keep the harddisks on ZFS for bulk storage and backups. Having fewer VMs with ZFS replication should reduce the need for constant reboots.

Add new comment