Status | Assigned | Task | ||
---|---|---|---|---|
Resolved | None | T12538 4 Sept PowerDNS Outage across on some servers | ||
Open | Universal_Omega | T12539 Fix our ability to respond to outages that require Proxmox |
Event Timeline
Tried rebooting some servers, but that doesn't seem to be a lasting solution. Not sure what the exact problem is. I suspect something network related, but all I really have to go off of is:
Sep 4 05:42:49 swiftproxy171 pdns-recursor[608]: msg="Failed to update . records" error="Too much time waiting for .|NS, timeouts: 5, throttles: 0, queries: 6, 7508msec" subsystem="housekeeping" level="0" prio="Error" tid="0" ts="1725428569.875" exception="ImmediateServFailException" Sep 4 05:42:49 swiftproxy171 pdns-recursor[608]: msg="Failed to update . records" subsystem="housekeeping" level="0" prio="Warning" tid="0" ts="1725428569.875" rcode="-1"
Finished rebooting affected servers and we seem to be all up now. Barring a proper post-mortem, this is hopefully resolved. cc @OrangeStar @Universal_Omega
Tentatively resolved again, waiting for confirmation from the datacenter on what exactly happened. They did say it was resolved, but I would like to continue monitoring for a bit to ensure services don't go back down.
@Void yep, this is resolved and we have an explanation on what happened from FiberState.
It was related to neighbor discovery on one of our distribution switches. We've made a change to the VLAN that should take care of it.
Closing this.
This is likely due to those servers losing IPv6 connectivity and thus being unable to query the DNS.