FSSLC: The actual next step in Miraheze's growth
New servers unleash performance like never before seen

2021, 2022, and 2023 were extremely rough years. We went from being always slow (2021) to being sometimes slow (early 2022) to being very slow and with a broken database which halted 1600 wikis for a few weeks (late 2022) to almost shutting down (2023). This has been a rollercoaster for everyone but we want to reaffirm that Miraheze is never going away. Now more than ever, we are dedicated to our lovely users and we are doing the absolute most to improve your experience and to make you proud of the farm you host your wikis on.

To that end, WikiTide has helped Miraheze improve its infrastructure tenfold and has also helped Miraheze build a solid foundation for the first time in almost a decade. We are pleased to announce our newest server cluster, FSSLC. Named following the naming convention of our previous server clusters, FS stands for Fiberstate and SLC stands for Salt Lake City, Utah, in the southwest United States, where the data center is located.

Our new servers have helped improve performance and load speeds by a huge amount. You are already reaping just some of the benefits of this server change! With this, we intend to ensure much more stability (like never before seen), futureproofing, and true resilience.

These servers are far better than our old ones in our SCSVG cluster which were, as a Site Reliability Engineer stated, "on their last leg" and "on life support". We've come a very long way since 2021. We know many users may be curious about how we began this odyssey so we have written a summary of how we ended up here and what FSSLC will mean for Miraheze's future.

The birth of the idea of our own hardware and SCSVG

If you've been a long-time Miraheze user, you may remember how slow Miraheze was in 2021. It was even slower than during the worst slow periods of 2023. Many new communities joined our platform that year, including various colossal wikis. While the farm was growing exponentially, our servers weren't. With every new wiki that migrated to Miraheze, the platform got slower and slower until it loaded at a snail's pace.

This crisis was not lost on Miraheze management. In the past year, there had been several server migrations, none of which helped for too long and some of which ended in disaster. After considering all options carefully, it was decided that we had outgrown renting servers and instead needed to buy our own. Buying our own hardware would mean we could upgrade it at a later date without needing to pay more monthly to a server host, instead only having to focus on the one-time purchase of the hardware needed.

At the time, this was the most sane option. To upgrade our resources to the projected need back then while remaining on rented bare metal servers on OVH, our then provider, it was estimated we'd need to go from our old ~£320 GBP (~$415 USD at that time) budget to £500 GBP (~$650 USD at the time). That $235 USD climb was deemed to be a death knell to the project. We simply could not afford that exorbitant amount, our donations would not allow for it. Meanwhile, collocation costs would be around £900 GBP per quarter, so £300 per month. That price was very captivating and fit in with our project budget.

Once settled, the hardware was bought over the course of a month, assembled, and by the end of December 2021, the servers were racked up at the data center and virtual machines were being set up. J17 was published to celebrate this. Finally, on 14 January, 2022, we migrated over to our new data center, codenamed SCSVG (SC standing for ServerChoice, our collocation provider, and SVG standing for Stevenage, the city where the data center was located in).

The woes roll right on in

While we had hoped the migration to SCSVG would be smooth as butter, it was far from it. In order to transfer all the databases from our old servers to the new ones in a timely manner, some tables such as the parsercache and recentchanges tables weren't moved over. These tables can be quite big but the data in them is easily reproducible. Once we migrated databases over, we had to rebuild these tables. As the parser cache was gone, load times skyrocketed during the initial days following migration which led to slow wikis. In addition, due to the recentchanges table being missing, many wikis were erroneously marked as inactive and closed.

After a few days, things stabilized but users started commenting about occasional 5xx errors and slowness. As soon as the week following the migration, users complained of slowness though far less frequently and typically only during peak usage times. Evidently, this hardware was still not enough to sustain the platform entirely even though we added more capacity. This was foreseen as a potential issue, however, and in the roadmap was the plan to upgrade components bit by bit to better ones. The hardware we had was essentially a stopgap and meant only to be used temporarily while we mustered up the funds to buy better hardware.

We had anticipated that, through the fundraiser, Miraheze would potentially generate not only enough to cover the entire year of server costs for Miraheze but also extra to cover part upgrades. This could have happened had not one significant event occurred.

UK energy crisis: the beginning of the end for Miraheze

Our hopes came to a crashing halt when ServerChoice informed us that they would be raising our quarterly bill from £900 to £1530.00 (~$1940 USD), or an extra £2520 GBP (~$3200 USD) per year for a yearly total of £6120 GBP (~$7,760 USD), up from our £3600 GBP (~$4,565 USD) initial bill. Imagine what we could've bought with that extra $3200? Many new servers, more RAM, better processors, more disks, the whole package! In the final months of our contract with them, they charged us as much as £1650 ($2090 USD) per quarter for a total of £6600 GBP (~$8370 USD) per year, or £550 GBP (~$700 USD) a month! Our literal worst dream came true.

To recap in case you got lost in the sea of numbers: we migrated to our own hardware so we could only pay £300 a month instead of the £500 we'd need for server hosting and in the end, ended up paying £550.

With that occurring, our dreams of upgrading were crushed entirely. There was absolutely zero room to upgrade. If we held fundraisers, they'd only cover basic sustenance for our servers and would never be able to cover upgrades. We were stuck in a predicament and could do absolutely nothing.

This proved to be the beginning of the beginning of the end for Miraheze. Remember how we said the hardware we initially bought was a stopgap? This would come back to bite us terribly.

Temporary hardware that far outlived its shelf life

In retrospect, many may criticize management for the hardware we ended up buying. It was rather slow and old but, had our server costs not almost doubled, it would've been a very brilliant plan and the best plan ever.

The shelf life came and went for our hardware. The hardware that was meant to last a few months to allow us time to upgrade was now unexpectedly living past its originally planned end of life and would come back to bite us terribly.

On 16 November 2022, at around noon UTC, cloud14 locked up entirely. A disk had failed. The disk in question was a consumer-grade Western Digital Green SSD which, as described by Western Digital themselves, was "best [suited] for everyday light computing" (emphasis ours). Again, this hardware was meant to be stopgap, not permanent, so it was always a matter of when, not if, the drives would fail as they were used way over their intended usage or projected service life.

Some techies in the audience may wonder "Why didn't you set up RAID to ensure redundancy and protect against a single disk failure?". To that, we answer that we did set up RAID but one disk failed 10 hours before meaning we had no data parity and then a second disk failed meaning the entire RAID array failed. All disks in the RAID array were the same model so it was likely that they were all worn out pretty badly as they weren't intended to be used in the way we used them.

The aftermath of the cloud14 incident was very messy, stressful, and terrible. This burnout contributed to the June 2023 mass resignation where all these factors compounded to make a very disgruntled group of core volunteers.

New and old farm, united by old struggles

When WikiTide forked from Miraheze, they initially used Amazon Web Services (AWS), using leftover resources from WikiForge. While AWS was enough for WikiForge and its budget, WikiTide wreaked havoc on those servers as the specs were not enough. More servers were needed and WikiTide outgrew Amazon Web Services. To properly sustain the service while staying on AWS, a huge budget would be needed. In three months, WikiForge/WikiTide racked up a bill of $1,200 with WikiTide using around 85-95% of the resources, yet the service deteriorated as it became more and more slow. This was unsustainable.

As WikiTide was a newer service, it had far less 'baggage' and had greater license to try new things out. WikiTide went around server providers, getting quotes and negotiating. In the end, they were able to secure a very good deal for a powerful server which would be a great overkill for WikiTide and would allow them to have space to breathe and expand quite liberally.

During that same exact period, Miraheze was not doing so well. Despite bleeding out wikis, Miraheze was still very slow. In addition, with a great amount of volunteers fleeing to WikiTide, there were big backlogs in every single area, from SRE to Stewards. Miraheze was running on borrowed time. The disks were still on the verge of failure but there was absolutely no solution in sight. The failing RAID array meant that the disks responded super slowly (despite being SSDs) and if one more drive failed, the whole array would fail yet again leading to another cloud14/db141 incident, perhaps even worse now that a big number of volunteers who helped out during that incident had left.

United as one, once again?

WikiTide had been shopping around for deals, speaking with providers, and negotiating lower rates, and came upon a very great deal for WikiTide. The server found was twice as powerful as WikiTide's current server and wasn't too much more expensive.

At that same time, the Miraheze Limited Board of Directors requested a Request for Comments (RfC) to be published on the future of Miraheze. The intended 'new' operators of Miraheze, Miraheze Foundation, had hit rough waters due to a big misfire by a Director. A different Director for Miraheze Foundation joked that WikiTide should also throw their hat into the ring to see if the community would potentially humor the idea of a merger and reunification of the fork. WikiTide accepted and the community favored the option by a landslide.

Let's get to work!

After the RfC passed and the paperwork needed was finished, Miraheze was brought under the auspices of WikiTide Foundation. Now united once again, WikiTide shared its improvements done to various tools like CreateWiki, ManageWiki, Puppet, and, most important, its servers. Leveraging off preexisting talks with our new hosting provider, we were able to secure more servers at a far better cost (around the cost of our old OVH servers). These servers, in turn, were a huge upgrade. We now had CPUs and RAM that weren't almost old enough to get a driving permit and much, much faster NVMe drives.

As soon as the last server was delivered on 22 January 2023, we got right to work and began Swift migration on 24 January, followed by databases a day later on 25 January. We tried to make migration as smooth as possible but our servers weren't too happy about that.

"Let's go out with a bang", wait, no, not you db131!

Randomly, on 26 January, db131 restarted. When it did, it refused to accept connections, citing "an error reading communication packets". A single database server going bad is typically bad but this was even worse because mhglobal and metawiki, the two core global databases, were on there meaning that every single wiki was also down as a consequence.

This was the last straw and seeing how cryptic the error was, it was decided to just migrate everything over the "rugged" way (that is, shutting down MySQL and copying over its data directory instead of being fancy and dumping every database or doing replication which, though it takes longer, would've ensured less downtime had it actually worked) before something else crashed and burned or a disk failed. The migration was started at around 6AM UTC and finished in ~12 hours or so.

Enough textwalls! Is This Better?

"For everyone, yes - on paper!" - John, J17

A few days later and the farm is at its zenith of speed. No slowdowns, no 5xx errors, nothing, so far. The server statistics also agree. No server is running out of memory, workers, space, nothing! It's smooth sailing for now so yes, this is better, not only on paper but in practice!

Each server has 40 cores/80 threads at 3.8GHz clock speed, 256GBs of RAM, 4TB U.2 NVMe disk space, and a 10Gbps connection. This is a humongous improvement over our old servers, poor SCSVG.

Miraheze is finally in a state where our service isn't a laughing stock because of our slow speeds. We're reaching a golden era like never before. We faced many challenges and climbed many mountains, but we are now at a point of stability like none other. Truly stable servers are the bedrock on which Miraheze can be firmly founded upon. We can now recruit volunteers, handle more wikis, and be in a much more stable position thanks to this.

We would like to thank our donors who have absolutely pulled through this past fundraiser. While we're not at our goal just yet, we've progressed enough that we have been able to make these improvements. If we've been able to do so much, imagine what we'll be able to do!

With your support, Miraheze will break barriers and become the best wiki-hosting service on the planet. We're not done yet with improvements, we want to improve waaaaaay more, like deploying CirrusSearch (soon! already done! see T11743). Thank you, everyone, for sticking with us, through thick and thin, and for being part of our big family!

Written by Universal_Omega on Feb 3 2024, 07:01.
Director of Site Reliability Engineering
Commetia, OrangeStar, Rodejong
"Like" token, awarded by Waki285."Like" token, awarded by Megacane."Like" token, awarded by Commetia."Like" token, awarded by Samwilson."Like" token, awarded by OrangeStar."Like" token, awarded by Rodejong.

Event Timeline

Now that is a well written piece, and an honest one at that.
Normally you read excuses and promising golden mountains.
Since I am a new member since a few months, I've not experienced all the hardships, and even though I see a lot of warning signs I would normally would stay away from, I trust that the openness and honesty stays, and as long as that is the case, I'll be staying as well.

I have experienced a willing mentality, as well as a core of people who always help to the best of their ability.
So thanks Team for making the hard decisions and making the transition as painless as possible.

Nice to see the AWS surprise huge bill meme is more than just a meme :).

Glad to see you guys back around!

I'm very glad that after all the nightmares Miraheze had to go through, we made it out bigger, better, and stronger!
I didn't realize just how bad it was; I knew there was stuff going wrong— but I never thought of this!

Could I put this onto Miraheze Meta (on Community portal) for everyone else to see?
(I probably will if no one answers for a while, and don't worry— I'll credit)

Alright, I have put this on the community portal.