Page MenuHomeMiraheze

Fix swift logging a 404 as 500
Closed, ResolvedPublic

Description

In the logs things like:

/v1/AUTH_mw/miraheze-pokeclickerwiki-local-thumb/f/f1/ExplosiveCharge.png/15px-ExplosiveCharge.png

shows as a 500 but are in fact either 404 or 429 (rate limited MW side).

Swift should therefore be logging the correct status.

Event Timeline

Paladox triaged this task as Normal priority.Feb 16 2023, 17:38

Could this be why T10434 failed? If so, I think this task is high priority given the timeline left on Feb's SLO reporting

So it appears that I got it wrong at least further testing shows the below:

curl -I https://pokeclicker.miraheze.org/w/thumb_handler.php/4/41/Oricorio_\(Pom-pom\).png/60px-Oricorio_\(Pom-pom\).png
HTTP/2 500

In swift we go to the thumb_handler if the the thumb internally in swift returns 404 (e.g. it cannot find it within its storage).

Ok my terminal was escaping what the hell.

curl -I "https://pokeclicker.miraheze.org/w/thumb_handler.php/4/41/Oricorio_(Pom-pom).png/60px-Oricorio_(Pom-pom).png"
HTTP/2 429
curl -I "https://pokeclicker.miraheze.org/w/thumb_handler.php/4/41/Oricorio_(Pom-pom).png/62px-Oricorio_(Pom-pom).png"
HTTP/2 500

I'm not sure how we resolve this? It's correctly returning (from mw). Which transfer to swift.

BrandonWM raised the priority of this task from Normal to High.Feb 17 2023, 05:12
BrandonWM subscribed.
In T10510#211692, @John wrote:

Could this be why T10434 failed? If so, I think this task is high priority given the timeline left on Feb's SLO reporting

Moving to high priority per John above.

So one proposal I’ve thought of is this:

  • create two monitoring url in the rewrite module. One for frontend and one for backend. We shouldn’t be monitoring MW backend as the slo is for swift.
  • some how get this monitoring in prometheus?

I was looking at the wmf Module to take some clues which is what brought me to two monitoring urls. See https://github.com/wikimedia/operations-puppet/blob/production/modules/swift/files/python3.9/SwiftMedia/wmf/rewrite.py#L343

Essentially what that means is when the frontend url is hit the application returns a status of ok. The backend ones gets a bit more advanced. It calls out to the storage to ensure the storage works.

That url is used by the wmf for icinga. I’m not sure how we’d use it in prometheus. Alternatively is we change the status code if we get a 500 from thumb_handler. The wmf use thumbor now.

Thoughts @John not sure if what I proposed is acceptable as it’s more for icinga then it would be for Prometheus as it’s not really numbers generating rather is it up or down. Maybe you know how we could incorporate that into the slo dunno.

Another idea is to always return 200 for thumb_handler within rewrite.py (not sure if we fancy that?).

I've done this https://gerrit.wikimedia.org/r/c/mediawiki/core/+/889623 change, not sure if it will be accepted upstream.

John assigned this task to Paladox.

Change deployed locally