Page MenuHomeMiraheze

Infrastructure - MariaDB - SLO Availability/Error Failure
Closed, ResolvedPublic

Description

For December 2022 SLO Reporting - MariaDB failed the SLO for Errors and Availability.

The Error SLO agreed was: 5%.
The Performance achieved was: 7.95%.

The Availability SLO agreed was: 99.5%.
The Performance achieved was: 98.7%.

Please investigate the reasons behind not meeting the SLO and provide a clear summary on this task identifying whether:

  • the failure was transient due to factors outside of the teams control, or
  • the failure was preventable and clear steps have been taken to investigate and implement controls to minimise the risk of failing in January 2023.

Event Timeline

John triaged this task as Normal priority.Dec 30 2022, 22:35
John created this task.
Herald added subscribers: Unknown Object (User), Unknown Object (User), Void, Reception123. · View Herald TranscriptDec 30 2022, 22:35
John claimed this task.

Availability - having reviewed this, I am certain that the failure here is attributed to two things - one beyond our control and one where we have an open task that is blocked on Technology-Team (MediaWiki) for a resolution.

db121's availability has not been the best in December - having at least 4 different occurrences of outages. These total 71 minutes which gives db121 specifically an uptime of 99.84% which places it above the agreed SLO target. Therefore, I do not believe the problems with db121 have been the cause of the SLO failure and thus steps taken to address this are good but not imperative to future success.

db141 has definitely been the cause of this SLO failure for availability - Grafana recorded around 1024 minutes of downtime in December for this one server - 17 hours! This would mean db141 had an uptime of 97.03% - significantly below the SLO target.

db101, db112, db131 had 100% uptime - while db142 reported an uptime of 94.97% from its late introduction in 2022. Based on this information - I can confidential say the failure by the Infrastructure team to meet the SLO objective for MariaDB availability is down to the unfortunate errors with db141 and the maintenance window mistakes later on in December as well - steps are being taken to mitigate these.

For errors - work has been done into looking into this and it can be shown the metric used for this SLO cycle was incorrect - please see this new SLO metric on the dashboard which suggests the errors over the last month were actually 0.045%. Further, based on historical data - I would like to refine the new SLO objective to be 0.5% for now.