Page MenuHomeMiraheze

Infrastructure - Swift - SLO Errors Failure
Closed, ResolvedPublic

Description

For January 2023 SLO Reporting - Swift failed the SLO for Errors.

The SLO agreed was: 1%.
The Performance achieve was: 1.06%.

Please investigate the reasons behind not meeting the SLO and provide a clear summary on this task identifying whether:

the failure was transient due to factors outside of the teams control, or
the failure was preventable and clear steps have been taken to investigate and implement controls to minimise the risk of failing in February 2023.

Event Timeline

John triaged this task as Normal priority.Feb 4 2023, 12:25
John created this task.

@Paladox half way through Feb, we really need to look into this ASAP

Unknown Object (User) unsubscribed.Mar 18 2023, 03:38

Based on the fact that the error rate in swift dropped off massively in late February, at about the time T10510 was closed, I'm assuming that task was both the cause and additionally the solution.

To cleanly summarize the SLO failure: the image handler was improperly logging some 404 messages as 500 errors. This has been fixed in our configuration, but may be worth chasing down a better long-term solution upstream.

I'm not entirely sure if this is correct, but the better long-term solution may be using checkImages.php to find all missing images and then purging them from the database. I'll raise this matter with the MediaWiki team later, but I doubt it will be actioned before a point where we may have to review our local changes.