Page MenuHomeMiraheze

Add additional disk checks to monitoring
Closed, DeclinedPublic

Description

In T9794 the hdd storage volume on cloud12 exhausted which caused the filesystem of gluster122 to corrupt causing an outage. This should have been detected much sooner than it was.

I propose adding monitoring to the cloud* volumes and adding an icinga alert for read-only filesystems.

Event Timeline

Void triaged this task as Normal priority.Oct 17 2022, 23:13
Void created this task.
John claimed this task.
John subscribed.

Disks aren't exposed to the OS which makes monitoring them difficult within Icinga. I've tried a PVE monitoring check and it doesn't seem to work and I can't find any other replacements.

Long term the two big fixes are:

  • ensure disk space is allocated correctly
  • use LVM not LVM Thinpools for disk storage