Page MenuHomeMiraheze

Frequent 503s during past few days
Closed, ResolvedPublic

Description

For the past few days I (and other users as well) have been experiencing 503s quite frequently when accessing wiki pages. It seems quite unusual since around 2 weeks ago before my break I don't remember experiencing any 503s so something seems off. Could the mw111 alerts that get sent to sre from hund.io be potentially related? In any case, it doesn't seem to be the usual thing, it seems to be something happening frequently all of a sudden

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added subscribers: Unknown Object (User), Unknown Object (User). · View Herald TranscriptJul 25 2022, 14:37

I'd also add that I've seen users complain about performance being more slow than usual, so to me it indicates that it's not our usual performance issues (which had been improved thanks to @Universal_Omega in the past few weeks) but something new

I can add that I just got a 503 when refreshing this very task so that seems to indicate it's not an exclusively MW issue.

My experience as a daily user and editor: 1. often an 503 error and 2. slow performance

Supplement: When using Special:Import, if the number of imported templates reaches a certain number at one time, this error will also occur, resulting in failure to import normally.

I bring up the conversation to warn that it's getting worse and worse. The 503 or 502 are more and more frequent, sometimes it even bugs several times in a row for several minutes.

Unknown Object (User) added a comment.Jul 27 2022, 17:30

Supplement: When using Special:Import, if the number of imported templates reaches a certain number at one time, this error will also occur, resulting in failure to import normally.

That's not really related to our current issues. That's due to the fact that we need to kill off connections after a certain period of time to ensure no single one connection hogs up server processing time too much so if the import takes MediaWiki longer than a minute to process, that'll happen.

We are hoping that new servers will be in production by tomorrow and that should hopefully resolve the 503 issues.

Unknown Object (User) added a subscriber: FatBurn0000.

Perhaps related, but the semantic "upgrade key" error suddenly came back after 10:30 tonight for whatever reason. (On my wiki, and the handful of others with SMW installed.)

Unknown Object (User) added a comment.Aug 3 2022, 03:33

Perhaps related, but the semantic "upgrade key" error suddenly came back after 10:30 tonight for whatever reason. (On my wiki, and the handful of others with SMW installed.)

That was unrelated, but I have now fixed it.

Update: The 503 errors I had are no longer an issue, sorry for wasting your time.

Unknown Object (User) added a subscriber: Sivisvivam.

Another update: Actually, I'm starting to experience them again.

Unknown Object (User) added a comment.Aug 4 2022, 00:11

We're aware. The current set of issues is caused due to the Gluster migration which should end in a day or so.

Periods of repeated 503s across multiple wikis are now taking place hourly, and lasting for minutes at a time. We can't tell our users to wait for "a day or so" - please provide a more precise estimate of how much longer this is going to continue.

Unknown Object (User) added a comment.Aug 4 2022, 03:19

Periods of repeated 503s across multiple wikis are now taking place hourly, and lasting for minutes at a time. We can't tell our users to wait for "a day or so" - please provide a more precise estimate of how much longer this is going to continue.

I said about a day because last I heard from Infrastructure, it would take 27 hours when I said that. Since then, the estimate has dropped down to 22 hours per SRE Infrastructure.

In T9581#194390, @Agent_Isai wrote:

Periods of repeated 503s across multiple wikis are now taking place hourly, and lasting for minutes at a time. We can't tell our users to wait for "a day or so" - please provide a more precise estimate of how much longer this is going to continue.

I said about a day because last I heard from Infrastructure, it would take 27 hours when I said that. Since then, the estimate has dropped down to 22 hours per SRE Infrastructure.

It has been more than the 27 hours that Infrastructure originally reported. This is still taking place (and there is no mention of the situation on the Twitter feed that is included in the 503 page). Does anybody have any idea as to when normal service will be restored?

Unknown Object (User) added a comment.Aug 5 2022, 16:03

We deeply apologise for the issues. The time the system gave us was incorrect. It was about 90 hours off by my own estimate, and based on the progress it has already done. There should be about 82 hours left based on how much as already completed.

Again, on behalf of SRE, we truly apologise for all the issues this has caused to the community, and understand the frustration it has also caused.

Unknown Object (User) added subscribers: Omega64, MirahezeBot.

I am truly grateful for your work. But I also encounter these errors from time to time.

Unknown Object (User) added a subscriber: Anoo2771.

When the site comes back from the current outage period, could somebody please put a Miraheze-wide message onto the wikis saying why this is happening and how long it is expected to continue? I seriously doubt that I'll have time to put sitenotices on all three wikis that I have mod rights on before the next outage, given the frequency and length of the outages, but a Steward would probably have enough time to do it for everybody at once.

Was going to make a new task but since this is known...

Apparently 503 errors seem to be systematic on my side right now. This morning I got a bunch of these as well, but not in the way it is right now. Also see if you can put a sort of info message about the issue on the Twitter account if needed, since apparently there would be at least 3 or 4 days remaining if I well understood...?

I don't get how this is happening randomly instead of a continuous shutdown.

They are really becoming an issue now.

Took me over a minute for an OAuth token to log in here. The wiki is seriously slow, if it works at all.

Unplanned, unannounced 90 hour maintenance windows are... sigh. Let's just say I'm looking forward to reading the incident report when this is all over.

I looked to see if there was anything there now, but Special:IncidentReports returns, reproducibly:

[c7a4ab9d2632fe0020d33e16] 2022-08-07 08:17:52: Fatal exception of type "DomainException"

Seems like the 503s are finally starting to let up. I haven't got one this afternoon.

Unknown Object (User) added a subscriber: Sirlance3000.

Seems like the 503s are finally starting to let up. I haven't got one this afternoon.

I can't say I've noticed any improvements at all with my wiki. I'm still frequently getting 503/502 errors that last for minutes at a time.

Edit: I think it's actually getting worse. My wiki has been down with a 503 error for over 10 minutes now.

Me too, I am still frequently getting 503 errors.

Still happening here too...
Error 503 Backend fetch failed, forwarded for 118.208.135.232, 127.0.0.1
(Varnish XID 228099867) via cp20 at Mon, 08 Aug 2022 10:36:31 GMT.

You guys are not alone, 'cause I keep getting bombarded by those 503s as well...

We deeply apologise for the issues. The time the system gave us was incorrect. It was about 90 hours off by my own estimate, and based on the progress it has already done. There should be about 82 hours left based on how much as already completed.

Again, on behalf of SRE, we truly apologise for all the issues this has caused to the community, and understand the frustration it has also caused.

We're coming up on the end of this period. Could we have a status report, please?

Unknown Object (User) added a comment.Aug 8 2022, 12:25

We're coming up on the end of this period. Could we have a status report, please?

About 24-36 hours. But not percise and could be shorter or longer. This is again, just by my own estimate.

Thank you - I'll update sitenotices the next time the 503s stop.

In a possibly-related matter, https://status.miraheze.wiki/ does not appear to have been updated since August 4.

For me, it doesn't slow down at all. Bugs are still frequent, it's a disaster.

82 Hours Left ****

{F1842876}

Unknown Object (User) added a comment.Aug 9 2022, 02:25

40GB left to transfer. Hopefully it will be done soon.

Unknown Object (User) closed this task as Resolved.Aug 9 2022, 04:06
Unknown Object (User) claimed this task.

The migration is now complete. If 503s once again become quite frequent, this can be reopened. Thank you everyone for your patience with this and once again, we deeply apologise for all the issues over the past days.

Unknown Object (User) added a comment.Aug 9 2022, 05:15

Still seem to happen to me.

image.png (471×883 px, 39 KB)

Is it consistent and frequent?

Yeah, it just happened again despite the completion of the migration. I did receive 502/503s after, but it did not cause an outage that lasted for minutes and prevented me from accessing sites. Right now, I just bumped right into one and could barely navigate the site.

This task isn't quite resolved yet.

Bugs and slowdowns are not fixed at all. Impossible to navigate or even do anything.

Everything works perfectly now, thanks for fixing!

Just happened to me again:
Error 503 Backend fetch failed, forwarded for 118.208.135.232, 127.0.0.1
(Varnish XID 38998613) via cp30 at Wed, 10 Aug 2022 10:08:57 GMT.

Unknown Object (User) added a comment.Aug 10 2022, 10:35

They will still happen, just less frequently, or at least I would hope they are less frequent.

Could you open this again? It's obviously not solved yet.

Edit: I opened the task again.

Unknown Object (User) added a comment.Aug 10 2022, 15:57

Could you open this again? It's obviously not solved yet.

Edit: I opened the task again.

It will still happen, though hopefully not as often, and unfortunately, there is nothing more we can do at the moment, to my knowledge, to fix it. More stuff is planned for the future, but as of now, I think everything that has been done to mitigate this at this time, has been done.

However I will still look and see if anything else to further reduce the issue can be done.

Unknown Object (User) closed this task as Resolved.Aug 10 2022, 21:21

Per above:

It will still happen, though hopefully not as often, and unfortunately, there is nothing more we can do at the moment, to my knowledge, to fix it.

If it does become very frequent it can be reopened again.