Page MenuHomeMiraheze

Varnish should clean up ramdisk after failure
Closed, ResolvedPublic

Description

In parent task T7318, varnish crashed due to the OOM killer (cause = my work, shouldn't have done that). However, it doesn't look like Varnish recovers properly:

May 16 19:25:19 cp12.miraheze.org kernel: oom_reaper: reaped process 17048 (cache-main), now anon-rss:0kB, file-rss:0kB, shmem-rss:82104kB
May 16 19:25:19 cp12.miraheze.org varnishd[24046]: Child (17048) died signal=9
May 16 19:25:19 cp12.miraheze.org varnishd[24046]: Child cleanup complete
May 16 19:25:19 cp12.miraheze.org varnishd[24046]: Child (29936) Started
May 16 19:25:19 cp12.miraheze.org varnishd[24046]: Child (29936) not responding to CLI, killed it.
May 16 19:25:19 cp12.miraheze.org varnishd[24046]: Child (29936) Pushing vcls failed:
                                                   CLI communication error (hdr)
May 16 19:25:19 cp12.miraheze.org varnishd[24046]: Stopping Child
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) died signal=6
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) said Child starts
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) said Assert error in vsmw_newcluster(), common/common_vsmw.c line 247:
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) said   Condition((VFIL_allocate(fd, (off_t)len, 1)) == 0) not true.
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) said   errno = 28 (No space left on device)
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) said Wrong turn in child_signal_handler(), cache/cache_main.c line 284: Signal 6 (Aborted) received at 0x73000074f0 si_code
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child (29936) said   errno = 28 (No space left on device)
May 16 19:25:20 cp12.miraheze.org varnishd[24046]: Child cleanup complete

Fix: ask systemd to clean up ramdisk before start?

Event Timeline

Southparkfan triaged this task as Normal priority.May 16 2021, 19:35
Southparkfan created this task.

Same issue as T6952. Questions; is ExecStopPost (rPUPC1b7d9bd772c8d7f9a7474e0e909485e0807ca6c9) executed after a failure? And does systemd see the cache-main crash as a failure?

Varnish has multiple threads: cache-main is one of them. An OOM'ing cache-main should be marked as a service failure by systemd, but if it isn't, it could explain some of the symptoms.

Gloves345 raised the priority of this task from Normal to Unbreak Now!.Jun 13 2021, 08:52
RhinosF1 lowered the priority of this task from Unbreak Now! to Normal.Jun 13 2021, 09:05
RhinosF1 raised the priority of this task from Normal to High.Aug 6 2021, 14:06

This needs doing. Another outage extended today by the mount not working (server was rebooted).

RhinosF1 added a parent task: T7761: cp13 down.

And I'm assuming why cp13 is down now. Removing SPF as he isn't working on this but is the cause of a now 16 hour outage.

John claimed this task.
John subscribed.

Hopefully with the above commit this time.

Void subscribed.

Both cp12 and cp13 OOM'd tonight and didn't restart cleanly. Logs suggest this issue isn't fixed.

Looking at grafana for cp13, when the software OOMs, the disk space reduces significantly immediately - which suggests there is proper disk clean up occurring. This is replicate on cp12 as well.

System logs show the child restarts, no errors are displayed. Logs show that ramdisk is clearing now.

Closing this task as resolved as there's no evidence that this is still a problem, but with lack of logs to suggest a new problem - this is stuck on waiting for it to occur again so it can be debugged in the live environment.