Page MenuHomeMiraheze

503 Backend fetch failed (October 3, 2022)
Closed, ResolvedPublic

Description

Error 503 Backend fetch failed, forwarded for 2a01:cb04:74c:da00:55c:4bee:f20:73f5, 127.0.0.1
(Varnish XID 275677553) via cp23 at Mon, 03 Oct 2022 20:27:26 GMT.

It's been few minutes already, there was a maintenance last week near the same hours but I don't think I'm aware about a new one? Unless I missed something?
Not restricted to a single wiki btw, all Miraheze wikis seem to be affected.

Event Timeline

It is indeed down right now, confirmed on Discord. They are working on it, currently no ETA.

Seems to be up again now :) So probably fixed?

Unknown Object (User) closed this task as Resolved.Oct 3 2022, 21:07
Unknown Object (User) claimed this task.
Unknown Object (User) triaged this task as Unbreak Now! priority.
Unknown Object (User) added projects: MediaWiki, MediaWiki (SRE).
Clarasiir subscribed.

It isn't fixed still. Sometimes loading of pages is just insanely slow, but repeatedly for the past hour I keep getting 503/502 errors.

Error 503 Backend fetch failed, forwarded for 2601:247:4200:cf50:4942:817e:58c5:2de4, 127.0.0.1
(Varnish XID 599097859) via cp32 at Mon, 03 Oct 2022 21:53:33 GMT.

Unknown Object (User) closed this task as Resolved.Oct 3 2022, 23:13

Thank you, for all your patience here. We should now be up again for the most part. We still have to see if we will remain up though. We apologise for the inconvenience this has caused. We are working to improve performance, and we hope that soon the far-to-often downtime will become a distant memory, over the last few weeks we have been able to finally make a lot of progress on improving things (performance and reduce 502/503 errors), and we hope to able to continue to improve things more in the near future. Again, we are very sorry for the inconvenience this has caused.

Unknown Object (User) reopened this task as Open.Oct 3 2022, 23:29
Unknown Object (User) removed Unknown Object (User) as the assignee of this task.

Seems we may be going intermittently down again, so re-opening until fully resolved.

MacFan4000 subscribed.

it was said that gluster maybe causing some of the issues

Unknown Object (User) added a comment.Oct 4 2022, 01:09

Yes, Gluster is definitely at lest a part of what is causing issues.

Void claimed this task.
Void subscribed.

Here's some details on the problem:

  • Cloud12 ran out of storage space on the hdd. This is likely due to a mistaken overallocation of the hdd storage node, as the VHD files are smaller than their total allocated space (until the disk is actually filled).
  • This caused the VHD for gluster122 to corrupt. I'm assuming either the file was expanded to fit additional partition information but this could not be added, or another VHD expanded into gluster122's allocation. No way to confirm.
  • fsck was run on gluster122 by paladox after clearing some space but failed to entirely fix the disk.
  • I moved phab121 to the ssd storage, and ran fsck to fully fix the problem. The first failure was likely due to the prior operation clearing a negligible quantity of disk space.

It appears we have booted properly following the second fsck operation.

Given how long the visible outage/issues lasted there should probably be an incident report.

Unknown Object (User) added a comment.Oct 4 2022, 15:44

Given how long the visible outage/issues lasted there should probably be an incident report.

Definitely should be, not probably.

MacFan4000 lowered the priority of this task from Unbreak Now! to Normal.

Reopening and lowering prio pending an incident report.

While MediaWiki was visibly affected, it had nothing to do with this issue, this was overallocation on Cloud Infrastructure

Created and working on https://meta.miraheze.org/wiki/Special:IncidentReports/53, should be published for public viewing once ready.