Page MenuHomeMiraheze

VarnishProject
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

Project for issues relating to Varnish configuration and deployment.

Recent Activity

Yesterday

Collei updated the task description for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Wed, Feb 28, 07:05 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei updated the task description for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Wed, Feb 28, 07:04 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Mon, Feb 26

Xena added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

A user has reported it happening again, it's possible the issue wasn't fully resolved.

Mon, Feb 26, 17:00 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Sounds good

Mon, Feb 26, 05:13 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Dicto added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Hmm, that's weird but now I don't get Error 500 neither by importing pages on gameshows nor by editing with code editor on chernowiki. Looks like the problem is actually resolved.

Mon, Feb 26, 04:13 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Visual editor being broken is already tracked in T11903. As for the other issues, can you reproduce this on any wikis other than that one?

Mon, Feb 26, 01:15 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Sun, Feb 25

Dicto added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Still got Error 500 when try to import pages on gameshows.miraheze.org

Sun, Feb 25, 23:52 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Agent_Isai closed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions as Resolved.

Once again purged 13-16G of Varnish logs.

Sun, Feb 25, 13:54 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei renamed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from 500 Internal Server Error - uploading images and editing pages to 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Sun, Feb 25, 01:54 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei merged T11900: XML ImportDump feature gives a "500 Internal Server" error into T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Sun, Feb 25, 01:53 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Several Discord users have experienced this

Sun, Feb 25, 01:51 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Sat, Feb 24

RhinosF1 edited projects for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions, added: Infrastructure (SRE); removed MediaWiki (SRE).
Sat, Feb 24, 22:47 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
RhinosF1 raised the priority of T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from High to Unbreak Now!.
Sat, Feb 24, 22:46 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei updated the task description for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Sat, Feb 24, 22:23 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei raised the priority of T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from Normal to High.

Reviewing Discord and Phabricator issues needing triage, this is probably a larger issue than I first assumed

Sat, Feb 24, 21:26 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei renamed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from 500 Internal Server Error - uploading images to 500 Internal Server Error - uploading images and editing pages.
Sat, Feb 24, 21:25 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei merged T11892: Can not make changes to page into T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Sat, Feb 24, 21:25 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

To be clear, I lowered this to Normal because it only appears to be happening on some wikis and not all of their pages. Most functionality still works. Feel free to change it back if I'm wrong about this triage.

Sat, Feb 24, 21:19 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei merged T11899: 500 Internal Server Error into T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Sat, Feb 24, 21:18 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

This is occurring again, see T11899

Sat, Feb 24, 21:17 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei reopened T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions as "Open".
Sat, Feb 24, 21:17 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Fri, Feb 23

Universal_Omega closed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions as Resolved.
Fri, Feb 23, 20:57 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Thu, Feb 22

Universal_Omega added a comment to T11887: Vanish cannot be executed.

https://github.com/miraheze/MirahezeMagic/pull/469

Thu, Feb 22, 04:35 · Configuration, MediaWiki (SRE)
1108-Kiju renamed T11887: Vanish cannot be executed from Vanish cannot be executed to varnish cannot be executed.
Thu, Feb 22, 04:31 · Configuration, MediaWiki (SRE)
1108-Kiju triaged T11887: Vanish cannot be executed as Normal priority.
Thu, Feb 22, 04:30 · Configuration, MediaWiki (SRE)

Jan 24 2024

Agent_Isai closed T11052: Do something about broken twiter feed as Invalid.

503s no longer display a Twitter feed. They instead link to a static help page on GitHub Pages which explains what may have happened and links to our social media and status page so technically this is invalid?

Jan 24 2024, 00:34 · Varnish, Technical-Debt, Design
labster added a comment to T11052: Do something about broken twiter feed.

T&S exists now, and @Agent_Isai is likely the best person to approve what comes next.

Jan 24 2024, 00:28 · Varnish, Technical-Debt, Design

Oct 25 2023

OrangeStar added a comment to T11052: Do something about broken twiter feed.

If the problem is with CSP reviews I'd argue emfed has a better shot than facebook

Oct 25 2023, 14:10 · Varnish, Technical-Debt, Design

Oct 24 2023

MacFan4000 added a comment to T11052: Do something about broken twiter feed.

Replacing it with Mastodon is the easiest route, since you already have that up and running. A quick search brings up https://sampsyo.github.io/emfed/. I could write a PR including emfed from the jsdelivr cdn if wanted.

Oct 24 2023, 22:35 · Varnish, Technical-Debt, Design
OrangeStar added a comment to T11052: Do something about broken twiter feed.

Replacing it with Mastodon is the easiest route, since you already have that up and running. A quick search brings up https://sampsyo.github.io/emfed/. I could write a PR including emfed from the jsdelivr cdn if wanted.

Oct 24 2023, 19:22 · Varnish, Technical-Debt, Design

Sep 11 2023

Paladox closed T11207: Consider adding 'browsing-topics=()' to permission-policy header as Resolved.
Sep 11 2023, 13:11 · Infrastructure (SRE), Varnish, revi
RhinosF1 added projects to T11207: Consider adding 'browsing-topics=()' to permission-policy header: Varnish, Infrastructure (SRE).

Makes sense to me

Sep 11 2023, 11:59 · Infrastructure (SRE), Varnish, revi

Aug 9 2023

PlanToSaveNoWork added a comment to T11133: [trash].

**This is the truth Miraheze is burning like a house and the suppoesed firefighters are sitting relaxing having a coffee instead of helping

Aug 9 2023, 17:52 · Trash
PlanToSaveNoWork triaged T11133: [trash] as Unbreak Now! priority.
Aug 9 2023, 17:51 · Trash

Jul 11 2023

MacFan4000 triaged T11052: Do something about broken twiter feed as Normal priority.
Jul 11 2023, 19:04 · Varnish, Technical-Debt, Design

May 19 2023

MacFan4000 removed a member for Varnish: Southparkfan.
May 19 2023, 20:08

Apr 16 2022

RhinosF1 renamed T8983: 23 Mar 2022 DoS from 23 Mar 2022 DDoS to 23 Mar 2022 DoS.
Apr 16 2022, 09:17 · MediaWiki, Infrastructure (SRE), Varnish, Security

Mar 26 2022

John changed the visibility for T8983: 23 Mar 2022 DoS.
Mar 26 2022, 19:05 · MediaWiki, Infrastructure (SRE), Varnish, Security
John closed T8983: 23 Mar 2022 DoS as Resolved.

Spoke with @Paladox and no further action is needed on this task.

Mar 26 2022, 19:05 · MediaWiki, Infrastructure (SRE), Varnish, Security
John moved T8983: 23 Mar 2022 DoS from Incoming to Short Term on the Infrastructure (SRE) board.
Mar 26 2022, 17:14 · MediaWiki, Infrastructure (SRE), Varnish, Security
Reception123 added a comment to T8983: 23 Mar 2022 DoS.

@RhinosF1 Do we still need this task open since the incident has passed?

Mar 26 2022, 08:13 · MediaWiki, Infrastructure (SRE), Varnish, Security

Mar 23 2022

RhinosF1 added a comment to T8983: 23 Mar 2022 DoS.

NCSC are aware

Mar 23 2022, 23:04 · MediaWiki, Infrastructure (SRE), Varnish, Security
RhinosF1 assigned T8983: 23 Mar 2022 DoS to Paladox.

blocked at firewall level globally, let's keep an eye.

Mar 23 2022, 22:48 · MediaWiki, Infrastructure (SRE), Varnish, Security
RhinosF1 raised the priority of T8983: 23 Mar 2022 DoS from High to Unbreak Now!.
Mar 23 2022, 21:36 · MediaWiki, Infrastructure (SRE), Varnish, Security

Mar 14 2022

RhinosF1 edited projects for T8930: Persistent 503s on multiple wikis, including `metawiki`, added: Database, Infrastructure (SRE); removed MediaWiki (SRE).

00:08:29 <JohnLewis> dmehus: yeah, IO on cloud11's SSDs is pretty high because of piwik db migration

Mar 14 2022, 07:26 · MediaWiki (SRE), MediaWiki
RhinosF1 added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.

php-fpm looks to be struggling to keep up again.

Mar 14 2022, 06:54 · MediaWiki (SRE), MediaWiki
RobLa added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.

When I tried to reach https://robla.miraheze.org about 20 minutes ago , I received the following error:

Mar 14 2022, 04:30 · MediaWiki (SRE), MediaWiki
Dmehus added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
20:19 
PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.09, 2.06, 1.84
20:20 
RECOVERY - matomo101 SSH on matomo101 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0)
20:20 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 3.172 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
20:20 
PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 8.22, 7.19, 6.98
20:21 
RECOVERY - cp30 Stunnel HTTP for mw101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.312 second response time
20:21 
RECOVERY - cp31 Varnish Backends on cp31 is OK: All 12 backends are healthy
20:21 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.07, 1.69, 1.73
20:22 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.24, 6.69, 6.81
20:23 
RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy
20:23 
RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.10, 1.50, 1.66
20:25 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
20:26 <dmehus> Doug 
!sre
20:26 <icinga-miraheze> IRC echo bot 
PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:26 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 5.97, 6.63, 6.77
20:27 
PROBLEM - cp30 Stunnel HTTP for mw101 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
20:27 
PROBLEM - cp31 Stunnel HTTP for phab121 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
20:27 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 1.177 second response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
20:27 
PROBLEM - cp30 Stunnel HTTP for mw111 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:27 
PROBLEM - cp21 Stunnel HTTP for mw122 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:27 
PROBLEM - cp21 Stunnel HTTP for mw111 on cp21 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.011 second response time
20:27 
PROBLEM - cp31 Stunnel HTTP for mw111 on cp31 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.317 second response time
20:27 
PROBLEM - cp20 Stunnel HTTP for mw111 on cp20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.012 second response time
20:27 
PROBLEM - mw111 MediaWiki Rendering on mw111 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 1595 bytes in 0.008 second response time
20:28 <dmehus> Doug 
Can reproduce the above persistently
20:28 <icinga-miraheze> IRC echo bot 
PROBLEM - cp31 Stunnel HTTP for mw122 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
20:28 
PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
20:28 
PROBLEM - cp30 Stunnel HTTP for mw122 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

^ Additional icinga alerts

Mar 14 2022, 03:29 · MediaWiki (SRE), MediaWiki
Dmehus added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.
<icinga-miraheze> IRC echo bot 
RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 6.68, 8.41, 8.47
18:17 
PROBLEM - db112 Current Load on db112 is WARNING: WARNING - load average: 5.19, 5.81, 5.32
18:17 
RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 6.64, 7.32, 8.32
18:19 
PROBLEM - db112 Current Load on db112 is CRITICAL: CRITICAL - load average: 6.57, 6.00, 5.44
18:19 
RECOVERY - gluster101 Current Load on gluster101 is OK: OK - load average: 3.19, 3.18, 3.16
18:20 
alerting : [FIRING:1] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki
18:20 
RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.34, 7.62, 8.49
18:21 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.84, 8.74, 8.71
18:23 
PROBLEM - db112 Current Load on db112 is WARNING: WARNING - load average: 2.62, 4.79, 5.11
18:24 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 8.70, 8.49, 8.68
18:25 
RECOVERY - db112 Current Load on db112 is OK: OK - load average: 2.99, 4.25, 4.87
18:27 
→ darkmatterman450 joined (~darkmatte@user/darkmatterman450)
18:27 <icinga-miraheze> IRC echo bot 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.21, 9.05, 8.85
18:28 
PROBLEM - mw111 Current Load on mw111 is CRITICAL: CRITICAL - load average: 10.69, 9.69, 9.15
18:29 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 9.93, 9.14, 8.90
18:30 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 7.77, 9.17, 9.04
18:30 
PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 8.33, 8.92, 8.52
18:31 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.86, 9.82, 9.19
18:34 
PROBLEM - mw111 Current Load on mw111 is CRITICAL: CRITICAL - load average: 11.45, 10.20, 9.45
18:34 
PROBLEM - mw121 Current Load on mw121 is CRITICAL: CRITICAL - load average: 10.20, 9.44, 8.79
18:36 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 9.32, 9.86, 9.42
18:36 
PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 7.81, 8.99, 8.72
18:41 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.68, 9.52, 9.53
18:42 
PROBLEM - mw102 Current Load on mw102 is WARNING: WARNING - load average: 8.73, 7.72, 6.97
18:43 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.80, 10.19, 9.78
18:44 
RECOVERY - mw102 Current Load on mw102 is OK: OK - load average: 7.10, 7.53, 7.00
18:44 
PROBLEM - mw122 Current Load on mw122 is CRITICAL: CRITICAL - load average: 10.86, 8.60, 8.05
18:45 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 9.56, 9.96, 9.76
18:47 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 11.72, 10.28, 9.87
18:48 
PROBLEM - mw122 Current Load on mw122 is WARNING: WARNING - load average: 9.66, 9.10, 8.36
18:50 
PROBLEM - cp31 Current Load on cp31 is CRITICAL: CRITICAL - load average: 2.54, 1.96, 1.29
18:51 
PROBLEM - cp30 Current Load on cp30 is WARNING: WARNING - load average: 1.75, 1.65, 1.28
18:52 
PROBLEM - cp30 Stunnel HTTP for matomo101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:52 
PROBLEM - cp21 Stunnel HTTP for matomo101 on cp21 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 358 bytes in 0.157 second response time
18:52 
PROBLEM - matomo101 Current Load on matomo101 is CRITICAL: CRITICAL - load average: 20.08, 9.20, 4.29
18:52 
PROBLEM - cp31 Stunnel HTTP for matomo101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:52 
RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.08, 7.80, 8.46
18:52 
RECOVERY - cp31 Current Load on cp31 is OK: OK - load average: 1.10, 1.63, 1.25
18:53 
PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 9.95, 7.83, 6.37
18:53 
PROBLEM - matomo101 HTTPS on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:53 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
18:53 
PROBLEM - cp20 Stunnel HTTP for matomo101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:53 
RECOVERY - cp30 Current Load on cp30 is OK: OK - load average: 1.21, 1.54, 1.29
18:53 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.99, 9.74, 9.82
18:54 
RECOVERY - mw122 Current Load on mw122 is OK: OK - load average: 5.58, 7.61, 8.01
18:55 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 2.725 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
18:57 
PROBLEM - cp31 Varnish Backends on cp31 is CRITICAL: 1 backends are down. mw111
18:58 
PROBLEM - matomo101 Redis Process on matomo101 is CRITICAL: PROCS CRITICAL: 0 processes with args 'redis-server'
18:58 
PROBLEM - mw102 Current Load on mw102 is WARNING: WARNING - load average: 8.76, 8.09, 7.53
18:58 
RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 6.47, 7.93, 8.49
18:59 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.51, 7.73, 6.89
18:59 
RECOVERY - cp31 Varnish Backends on cp31 is OK: All 12 backends are healthy
19:00 
RECOVERY - matomo101 Redis Process on matomo101 is OK: PROCS OK: 1 process with args 'redis-server'
19:00 
PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:00 
RECOVERY - mw102 Current Load on mw102 is OK: OK - load average: 8.47, 8.20, 7.64
19:01 
PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 9.59, 8.57, 7.31
19:01 
PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.07, 1.82, 1.51
19:02 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
19:03 
PROBLEM - matomo101 NTP time on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:03 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.51, 1.73, 1.52
19:03 
PROBLEM - matomo101 Puppet on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:05 
PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:05 
RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 5.63, 6.85, 8.34
19:05 
PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.93, 4.15, 3.36
19:05 
PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.16, 1.89, 1.60
19:05 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.66, 3.35, 2.84
19:06 
PROBLEM - cp30 Stunnel HTTP for test101 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:06 
PROBLEM - cp30 Stunnel HTTP for mw121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:07 
PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 3.16, 3.74, 3.31
19:07 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 3.91, 3.31, 2.87
19:08 
PROBLEM - matomo101 ferm_active on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:08 
RECOVERY - cp30 Stunnel HTTP for test101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14564 bytes in 0.338 second response time
19:08 
RECOVERY - cp30 Stunnel HTTP for mw121 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 0.852 second response time
19:08 
PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 2 datacenters are down: 149.56.140.43/cpweb, 2607:5300:201:3100::929a/cpweb
19:09 
PROBLEM - gluster121 Current Load on gluster121 is CRITICAL: CRITICAL - load average: 4.93, 4.00, 3.10
19:09 
PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 7 backends are down. mw101 mw102 mw111 mw112 mw121 mw122 mediawiki
19:09 
PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 2 datacenters are down: 149.56.140.43/cpweb, 2607:5300:201:3100::929a/cpweb
19:09 
PROBLEM - matomo101 Redis Process on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:09 
PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.99, 4.13, 3.49
19:09 
PROBLEM - matomo101 Disk Space on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:09 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.69, 1.87, 1.67
19:09 
PROBLEM - matomo101 php-fpm on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:09 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.20, 3.56, 3.01
19:10 
RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online
19:11 
RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy
19:11 
RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online
19:11 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 3.54, 3.78, 3.18
19:11 
PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 2.38, 3.59, 3.38
19:11 
RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.37, 1.70, 1.63
19:11 
RECOVERY - matomo101 Disk Space on matomo101 is OK: DISK OK - free space: / 1205 MB (12% inode=80%);
19:11 
RECOVERY - matomo101 Puppet on matomo101 is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures
19:11 
RECOVERY - matomo101 php-fpm on matomo101 is OK: PROCS OK: 5 processes with command name 'php-fpm7.4'
19:11 
RECOVERY - matomo101 Redis Process on matomo101 is OK: PROCS OK: 1 process with args 'redis-server'
19:11 
RECOVERY - matomo101 NTP time on matomo101 is OK: NTP OK: Offset -0.005568474531 secs
19:12 
RECOVERY - matomo101 SSH on matomo101 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0)
19:13 
RECOVERY - matomo101 ferm_active on matomo101 is OK: OK ferm input default policy is set
19:13 
RECOVERY - matomo101 conntrack_table_size on matomo101 is OK: OK: nf_conntrack is 0 % full
19:13 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.85, 7.63, 7.55
19:13 
RECOVERY - cp30 Stunnel HTTP for matomo101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 3.464 second response time
19:13 
RECOVERY - cp21 Stunnel HTTP for matomo101 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.889 second response time
19:13 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 0.811 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
19:13 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 5.26, 4.24, 3.41
19:13 
RECOVERY - cp20 Stunnel HTTP for matomo101 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.705 second response time
19:13 
PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.51, 3.95, 3.54
19:13 
RECOVERY - cp31 Stunnel HTTP for matomo101 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.663 second response time
19:14 
RECOVERY - matomo101 HTTPS on matomo101 is OK: HTTP OK: HTTP/1.1 200 OK - 66479 bytes in 1.038 second response time
19:14 
PROBLEM - gluster121 Current Load on gluster121 is WARNING: WARNING - load average: 3.53, 3.88, 3.40
19:15 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 2.95, 3.69, 3.31
19:15 
ok : [RESOLVED] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki
19:16 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 9.25, 8.08, 7.21
19:16 
PROBLEM - gluster121 Current Load on gluster121 is CRITICAL: CRITICAL - load average: 8.47, 4.85, 3.77
19:17 
PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 3.26, 3.93, 3.66
19:18 
RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.62, 7.88, 7.25
19:19 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.06, 3.89, 3.45
19:19 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.77, 1.64, 1.59
19:20 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.11, 8.72, 8.18
19:20 
PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 9.69, 8.68, 7.66
19:21 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.29, 6.11, 6.73
19:21 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 2.87, 3.62, 3.41
19:21 
alerting : [FIRING:1] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki
19:21 
RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.21, 1.50, 1.55
19:22 
PROBLEM - cp31 Stunnel HTTP for matomo101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:22 
RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 6.77, 8.29, 8.10
19:22 
RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 5.47, 7.77, 7.48
19:23 
RECOVERY - gluster111 Current Load on gluster111 is OK: OK - load average: 2.23, 3.29, 3.32
19:23 
PROBLEM - cp30 Stunnel HTTP for matomo101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:23 
PROBLEM - cp20 Stunnel HTTP for matomo101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:23 
PROBLEM - matomo101 HTTPS on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:23 
PROBLEM - cp21 Stunnel HTTP for matomo101 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:24 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
19:24 
PROBLEM - gluster121 Current Load on gluster121 is WARNING: WARNING - load average: 1.87, 3.56, 3.75
19:24 
PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:24 
PROBLEM - cp30 Stunnel HTTP for mail121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:24 
PROBLEM - cp31 Stunnel HTTP for mon111 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:25 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 7.20, 6.62, 6.79
19:25 
PROBLEM - cp20 Stunnel HTTP for mw111 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
PROBLEM - cp21 Stunnel HTTP for mw121 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
PROBLEM - cp20 Stunnel HTTP for mw121 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
PROBLEM - cp30 Stunnel HTTP for mw111 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
RECOVERY - gluster101 Current Load on gluster101 is OK: OK - load average: 1.47, 2.91, 3.37
19:25 
PROBLEM - cp30 Stunnel HTTP for phab121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:26 
PROBLEM - matomo101 NTP time on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:26 
PROBLEM - cp20 Stunnel HTTP for mw101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:26 
PROBLEM - matomo101 Puppet on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:26 
RECOVERY - cp30 Stunnel HTTP for mail121 on cp30 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 427 bytes in 0.241 second response time
19:27 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.06, 6.37, 6.67
19:27 
PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 3 backends are down. mw101 mw102 mw122
19:27 
RECOVERY - cp20 Stunnel HTTP for mw111 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 7.008 second response time
19:27 
RECOVERY - cp21 Stunnel HTTP for mw121 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 7.478 second response time
19:27 
RECOVERY - cp20 Stunnel HTTP for mw121 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 7.526 second response time
19:27 
RECOVERY - cp30 Stunnel HTTP for mw111 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 5.210 second response time
19:27 
PROBLEM - cp31 Varnish Backends on cp31 is CRITICAL: 5 backends are down. mw102 mw111 mw112 mw121 mw122
19:27 
RECOVERY - cp30 Stunnel HTTP for phab121 on cp30 is OK: HTTP OK: Status line output matched "500" - 2855 bytes in 0.353 second response time
19:28 
PROBLEM - cp31 Stunnel HTTP for mw101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
PROBLEM - cp21 Stunnel HTTP for mw101 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
RECOVERY - matomo101 NTP time on matomo101 is OK: NTP OK: Offset -0.006324976683 secs
19:28 
PROBLEM - mw101 MediaWiki Rendering on mw101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
PROBLEM - cp30 Stunnel HTTP for mw101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
RECOVERY - cp31 Stunnel HTTP for mon111 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 33915 bytes in 1.185 second response time
19:29 
RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy
19:29 
PROBLEM - cp20 Varnish Backends on cp20 is CRITICAL: 1 backends are down. mw101
19:30 
RECOVERY - gluster121 Current Load on gluster121 is OK: OK - load average: 2.30, 2.80, 3.33
19:31 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.54, 6.84, 6.83
19:33 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.42, 6.67, 6.77
19:33 
RECOVERY - cp20 Varnish Backends on cp20 is OK: All 12 backends are healthy
19:34 
RECOVERY - cp21 Stunnel HTTP for mw101 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 3.751 second response time
19:34 
PROBLEM - cp31 Stunnel HTTP for mw122 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:34 
PROBLEM - cp31 Stunnel HTTP for mw112 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:34 
RECOVERY - mw101 MediaWiki Rendering on mw101 is OK: HTTP OK: HTTP/1.1 200 OK - 22336 bytes in 3.518 second response time
19:34 
PROBLEM - cp30 Stunnel HTTP for mw112 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:34 
PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:34 
RECOVERY - cp30 Stunnel HTTP for mw101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.313 second response time
19:34 
RECOVERY - cp20 Stunnel HTTP for mw101 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.015 second response time
19:35 
PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 7 backends are down. mw101 mw102 mw111 mw112 mw121 mw122 mediawiki
19:35 
PROBLEM - cp31 Stunnel HTTP for phab121 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:35 <dmehus> Doug 
SRE: persistent 503s on multiple wikis
19:35 <icinga-miraheze> IRC echo bot 
RECOVERY - cp31 Stunnel HTTP for mw101 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.312 second response time
19:36 
RECOVERY - cp31 Stunnel HTTP for mw112 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.325 second response time
19:36 
RECOVERY - cp31 Stunnel HTTP for mw122 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 3.995 second response time
19:36 
RECOVERY - cp30 Stunnel HTTP for mw112 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.358 second response time
Sunday, March 13th, 2022		about an hour ago
↓ 1 unread message (less than a minute)
dmehus 
New message input
Mar 14 2022, 02:37 · MediaWiki (SRE), MediaWiki
Dmehus triaged T8930: Persistent 503s on multiple wikis, including `metawiki` as Unbreak Now! priority.
Mar 14 2022, 02:36 · MediaWiki (SRE), MediaWiki