Page MenuHomeMiraheze

Cloud11 SSD disk failure; swiftac111 down
Closed, ResolvedPublic

Description

It seems like there's an SSD disk failure on cloud11 causing the disk to be read only and various services like Swift to fail.

Event Timeline

Reception123 triaged this task as Unbreak Now! priority.Apr 11 2023, 11:46
Reception123 created this task.

Just a reminder, will ask for all users to refrain from complaining about the issue or troubleshooting on it here. See Meta's tech noticeboard for help with that, this thread is for technical work on the problem. Thanks to all, and apologies. Hopefully we can resolve the issue soon.

^
That above, from what I read, gets rid of some 500 ERRORS as an attempt to make the platform stable while they fix Swift.

The affected disks have been successfully reconnected. Currently working out how to repair the LVM volumes on the drive, as pve/data is reporting as corrupted. Currently unknown as to whether or not the swiftac server is affected. Both the proxmox host and swiftproxy are replaceable, but swiftac is not.

Good. Diagnose swiftac, I think.

From the looks of things, swiftac is going to need to be rebuilt. Some of the data can be scavenged, however much of the VM was corrupted beyond the point of usability. I also believe that proxmox will need to be reinstalled on cloud11, as the previous install is likely corrupted in a similar manner to swiftac.

Thank you, Void. I won't have to see any more Times New Roman font errors when I'm reading about logos.

Let's hope that this issue can be fixed ASAP!

We ask that non-technical members refrain from making comments on this task as 24 people are subscribed to this task and get notifications for every comment. Additionally, unnecessary comments may bury useful ones from system administrators so kindly refrain from commenting. Thank you.

I apologize. I just felt over-optimistic and wanted to cheer on! Thank you for letting me know.

Additionally, the GitHub comment was to explain to everyone what it does.

OrangeStar renamed this task from Cloud11 SSD disk failiure; swift proxies down to Cloud11 SSD disk failure; swiftac111 down.Apr 14 2023, 16:40

The operating system for cloud11 looks too corrupted for recovery, so @Paladox and I will be handling a reinstall sometime tomorrow.

Void lowered the priority of this task from Unbreak Now! to High.Apr 16 2023, 03:31

Cloud11 and swiftac111 have been reinstalled. It appears that swift is rebuilding the account and container indexes, so it may take a long while before everything shows back up. Will be keeping eyes on it to see what happens. Tentatively going to say this might be resolved, but not all container databases are rebuilt, and it will take some time to know if anything is actually missing.

Ran a few tests wiki setContainersAccess.php and it looks like the script restores the containers without damaging any of the files. I think, after testing with meta and commons, that it is completely safe to run it against all wikis to restore everything without waiting on the containers to repopulate on their own.

In T10717#216495, @Void wrote:

Ran a few tests wiki setContainersAccess.php and it looks like the script restores the containers without damaging any of the files. I think, after testing with meta and commons, that it is completely safe to run it against all wikis to restore everything without waiting on the containers to repopulate on their own.

This has now completed

RhinosF1 lowered the priority of this task from High to Normal.

Assigning to Reception123 to see why the l10n deploy didn’t log and confirm that the Special:Upload message was removed.

This then just needs an IR

Script ran, leaving open for IR

It would be best for @Paladox and/or @Void to handle the IR as the main responders.

Just to note this incident has now happened again.

Will be resolved by the new servers, no further action needed on this task.