Service disruption on 25 March 2025

Posted: 09.04.2025 17:58 | Tags: status homelab linux networking

Last month, my homelab experienced an incident that resulted in service disruption for 3 hours and 8 minutes, and because I like to cosplay as a sysadmin in my free time this is a post-incident analysis. The effects of the disruption were mostly external, mainly affecting the main site and ytrss, internal systems remained unaffected.

A virtual machine hosted on Linode needed to be moved between physical hosts for urgent and unplanned maintenance. A similar situation has happened in the past both as planned and unplanned events, so I had opportunities to prepare, but this still resulted in a failure to automatically recover this time. Several improvements were made to my infrastructure and deployment strategy, which resulted in this disruption being shorter than most but still longer than expected.

Background

Most services are hosted on the internal network, at home, and any communication to or from an external network is done through a jump server that also acts as a web server on Linode, called armona, named after a Portuguese island. The internal hosts add armona as a WireGuard peer which routes all incoming and outgoing traffic.

Observability

From multiple incidents in the past, it was clear that health checks needed to be conducted for each service and host with a reliable notification system in place. UptimeKuma was adopted a few years ago for this purpose. Each service and host is registered within UptimeKuma and health checks are done every 2 minutes. If two consecutive failures are seen the service or host is marked as degraded. Notifications of degradation are sent to a private Discord channel which sends notifications to my phone immediately. In other words, within 4 minutes of a failure host or service failure, I am notified.

In addition to health checks, system metrics, such as CPU, memory and network utilisation, are collected from each host and visualised using Grafana. This has helped troubleshoot issues in the past and can help prevent incidents through proactive notification through Grafana Alerting. Although, this has not been set up yet.

Mitigate common issues

A number of steps have been taken to avoid disruptions on the storage and application layer over the years. Persistent issues surrounding mounting NFS shares on remote networks resulted in efforts to localise storage. Hosting local NFS shares solved mounting and performance-related incidents but problems with the JBOD enclosure used still remained. In 2023, the JBOD enclosure failed which forced the move to directly attached storage for applications and this brought down the storage-related issues to none.

In terms of applications, a strong preference is given towards containerised workloads, including migrating away from unsupported applications or ones that do not work well in a Docker environment. This enables applications to be segmented, easily migratable between hosts and quickly restarted on failures. Most applications use containers and each of those are defined in a Docker compose file. Using Compose allows for the containers to be restarted on host reboots and the configuration changes to be tracked over time.

Incident timeline

At 15:56 UTC a support ticket was opened by Linode informing me of an issue on the physical host armona is located on and the subsequent emergency maintenance required. The service disruption began at 16:01 UTC when armona was being migrated to a new physical host. Post-migration, armona immediately and successfully booted up but did not attempt to establish connectivity to peers on the internal (home) network.

A failure to communicate to all internal services and hosts resulted in multiple alerts which were received through Discord at 16:03 UTC. This successfully notified me, but was dismissed as I was finishing up work for the day.

At 18:55 UTC my wife noticed Jellyfin did not load and this prompted an immediate investigation. Remembering the alerts indicating all services and hosts were offline, I immediately went to check armona, this was also when I noticed the emails from Linode indicating a host migration.

After logging into the host at 19:01 UTC I confirmed that the WireGuard interface was down and bringing it up resulted in all services coming back online. By 19:10 UTC connections to all services were restored and verified.

Improvements

The failure of the WireGuard interface to come up after boot was a result of human error. I then followed my own guide to enable WireGuard on boot. On 2025-03-26 at 17:27 UTC a test was done rebooting armona to verify the configuration. The WireGuard connection was successfully established allowing all services to communicate externally. As a precaution, all other hosts were checked to verify that WireGuard is enabled on boot.