Page MenuHomeMiraheze

db121 frequently OOMs
Closed, ResolvedPublic

Description

I hesitated to make this task high priority given that there are already so many and that the OOM's aren't that frequent but it seems like the issues aren't stopping and some users appear to be worried given the recent db141 incident it's not a good image to have another db server go down every once in a while. As far as I'm aware looking at SAL it's OOM'd on 10 December, 22 November, 13 November, 8 November

Event Timeline

Reception123 created this task.

There's minimal opportunity to grow memory on db121 - as far as I know the cause is likely parsercache which would mean the easiest fix is to reduce the amount of caching on MW side.

Unknown Object (User) added a comment.Dec 13 2022, 07:37
In T10117#203977, @John wrote:

There's minimal opportunity to grow memory on db121 - as far as I know the cause is likely parsercache which would mean the easiest fix is to reduce the amount of caching on MW side.

I also have mentioned a couple times that the cause is parsercache.

I was thinking, however, what if we moved parsercache to say db141, and instead potentially increased memory on db141 where we have more on cloud14? Another option I was thinking was investing in a smaller db server on say cloud14 that is specifically for parsercache, so that it going down, will also not bring down MediaWiki wikis.

I am not certain, but also decreasing caching MW-side, in my opinion, is not ideal also, caching less just doesn't seem like a preferred route, as us caching more lets us keep more, and as overall performance benefits.

Once the cloud13 reboots happen, a request for a smaller db server could be considered. Or moving it to another server as well could be considered.

Re-assigning as the action above identified is one for the MediaWiki team and not Infrastructure

Noting that last OOM was on January 4

Unknown Object (User) lowered the priority of this task from High to Normal.Feb 7 2023, 07:38

Lowering to normal, since at this time there is not much we can do on this per the above is kinda blocked on the extra disk space on cloud13 and the removal of cloud10. Until that is done, this can not progress. Since there is not much that can be done right now, and no progress on this task, I am lowering to normal.

What are the plans for extra disk space on cloud13? This just happened again

Unknown Object (User) added a comment.EditedMar 4 2023, 00:44

I have now tried something to help this, I have now done two things:

  • Sharded parsercache across 10 tables, so instead of nearly 6,000,000 rows on one table, we would have somewhere around 500,000 on 10 tables, however, it would likely be even a third of that, as secondly,
  • I have changed the parsercache setup, so not only is it on db121 now, but it is also now spread across db131 and db142 as well, as a multi parser-cache server cluster of all three servers to attempt to help the issues it can cause to a single server.

Crash of db121 has just occurred again, @Universal_Omega's ideas above do not seem to be working. Have there been any updates on cloud13 and cloud10?

Unknown Object (User) added a comment.Mar 9 2023, 06:00

Crash of db121 has just occurred again, @Universal_Omega's ideas above do not seem to be working. Have there been any updates on cloud13 and cloud10?

I am removing parsercache from db121 entirely right now, leaving only on db131/db142 to see if it helps.

Crash of db121 has just occurred again, @Universal_Omega's ideas above do not seem to be working. Have there been any updates on cloud13 and cloud10?

I am removing parsercache from db121 entirely right now, leaving only on db131/db142 to see if it helps.

Ah, okay. Hopefully so.

Unknown Object (User) edited projects, added Infrastructure (SRE); removed MediaWiki (SRE).Mar 11 2023, 19:06
In T10117#203977, @John wrote:

There's minimal opportunity to grow memory on db121 - as far as I know the cause is likely parsercache which would mean the easiest fix is to reduce the amount of caching on MW side.

Confirmed to have nothing to do with parsercache, so the action identified for MediaWiki is irrelevant now. After removing parsercache from db121, it still happens.

Unknown Object (User) unsubscribed.Mar 18 2023, 03:33

I intend to reevaluate this task at some point soon, hopefully. It seems likely, based on the changes MW side that this is most likely caused by something on the configuration side of the DB. I can't guarantee that I have the knowledge/ability to solve, but I'll see what I can do.

BrandonWM raised the priority of this task from Normal to High.May 25 2023, 14:46

This has continued to happen as of late. Moving priority to high as it affects all wikis on db121, though if wished to be moved back down, that's fine as well.

Read only on restart will be addressed here whenever I or someone else get the chance to deploy it.

Void lowered the priority of this task from High to Normal.Jun 1 2023, 20:02
Void moved this task from Incoming to Long Term on the Infrastructure (SRE) board.

Config change has been merged to remove read only, leaving the task open as a long-term goal for tuning mysql better.

Paladox claimed this task.

Managed to reduce OOMs by changing the config and giving it more ram.