Skip to content

Tailscale Throw Route Lost After needrestart Restarted networkd, 2026-05-02

Summary

All Incus containers became unreachable from tailnet peers after unattended-upgrades upgraded kmod and needrestart restarted systemd-networkd. The networkd restart wiped routing table 52, including the throw 10.10.10.0/24 entry that Tailscale uses to bypass the exit node for the Incus subnet. Tailscale detected the deletion but failed to re-add the route, leaving a bare default dev tailscale0 in table 52. Packets destined for 10.10.10.x containers were silently routed through the tsunagaru exit node instead of forwarded locally, causing HAProxy on tsunagaru to return 503 for all proxied services. The fix was systemctl restart tailscaled.

This incident shares its root mechanism with the 2026-04-10 incident. The ManageForeignRoutingPolicyRules=no fix applied after that incident protects Tailscale's ip rules (priorities 5210–5270) but does not protect routing table entries inside table 52 itself.

Situation

The Incus server acts as a Tailscale subnet router advertising 10.10.10.0/24 (all Incus containers) and several 192.168.1.x physical LAN hosts into the tailnet.

The Incus server also uses the Sakura VPS (tsunagaru) as a Tailscale exit node for all outbound internet traffic. When an exit node is active, Tailscale installs:

  • default dev tailscale0 in table 52, catching all outbound traffic and routing it to the exit node
  • throw <subnet> in table 52 for each advertised subnet, making lookups for those destinations fall through to the main routing table so packets reach local interfaces instead

Without the throw 10.10.10.0/24 entry, any packet destined for a container is caught by the default entry and exits through tailscale0 toward tsunagaru, which has no path to 10.10.10.x and drops it. This includes forwarded traffic from HAProxy on tsunagaru, which proxies services via the Tailscale subnet route.

Timeline (JST)

The Incus server runs in UTC. All times below are JST (UTC+9). Server log times are exact; investigation times are approximate.

Time (JST) Event
2026-05-02 15:36:54 apt-daily-upgrade.service runs; unattended-upgrades upgrades kmod
2026-05-02 15:37:15 needrestart restarts systemd-networkd. networkd flushes routing table 52. tailscaled logs monitor: RTM_DELROUTE: src=, dst=10.10.10.0/24, gw=, outif=0, table=52
2026-05-02 15:37:16 tailscaled logs "allowing exit node access to local IPs: [10.10.10.0/24 ...]" but fails to re-add the throw route: networkd is still mid-teardown when Tailscale tries to write back
2026-05-02 15:40:31 Uptime Kuma detects 503 on all proxied services
2026-05-02 ~19:00 Incident noticed while traveling; investigation begins from hotel. ip route get 10.10.10.196 confirms routing via tailscale0 table 52 instead of incusbr0. ip route show table 52 shows throw 10.10.10.0/24 missing
2026-05-02 ~19:05 journalctl -u tailscaled reviewed; the 15:37 JST networkd flush and Tailscale's failed recovery confirmed
2026-05-02 19:07:47 systemctl restart tailscaled. throw 10.10.10.0/24 restored
2026-05-02 19:08:31 Uptime Kuma confirms 200 OK
2026-05-02 ~19:10 /etc/needrestart/conf.d/incus.conf updated to blacklist systemd-networkd and tailscaled from auto-restart

Total downtime: 3 hours 28 minutes (2026-05-02 15:40:31 to 19:08:31 JST).

Root Cause

unattended-upgrades upgraded the kmod package, which is a purely userspace tool (provides modprobe, depmod, etc.). Despite no kernel module being loaded or unloaded, needrestart detected the upgrade and restarted systemd-networkd.

When systemd-networkd restarts, it performs a full interface teardown and rebuild. As part of that teardown it flushes all routes it considers unmanaged, including entries in table 52 that Tailscale installed. This fires a stream of RTM_DELROUTE netlink events.

Tailscale's netlink monitor detected the deletion of throw 10.10.10.0/24 from table 52 and its recovery path logged "allowing exit node access to local IPs". However, setRoutes() was called while networkd was still mid-teardown and flushing. Either the ip route add call failed silently, or networkd flushed the route again milliseconds after Tailscale re-added it. By the time networkd stabilized, Tailscale had already finished its recovery attempt and was no longer watching.

The ManageForeignRoutingPolicyRules=no setting applied after the 2026-04-10 incident prevents networkd from flushing Tailscale's ip policy rules (table 52 rule, priorities 5210–5270). It does not prevent networkd from flushing routes inside table 52 itself. The ip rules survived this incident; the table 52 routes did not.

The direct trigger, needrestart restarting systemd-networkd after a kmod upgrade, is unnecessary. kmod is not a daemon and does not need a service restart to take effect.

Impact

  • All Incus containers on 10.10.10.0/24 became unreachable from any tailnet peer, including tsunagaru
  • HAProxy on tsunagaru returned 503 for all proxied services (Forgejo, Mastodon, Miniflux, etc.) for the duration of the outage, since it reaches containers via the Tailscale subnet route
  • The Incus server itself could not reach its own containers: ip route get 10.10.10.x returned dev tailscale0 table 52 instead of dev incusbr0
  • Physical LAN hosts advertised as 192.168.1.x were unaffected: their throw routes were also deleted but packets for those hosts reach them via the LAN interface regardless

What Went Well

  • Uptime Kuma caught the outage within 3 minutes of the route deletion
  • ip route get 10.10.10.196 immediately pinpointed the wrong routing path
  • ip route show table 52 made the missing throw entry obvious
  • journalctl -u tailscaled gave a precise timestamp and confirmed the networkd flush as the trigger, enabling full timeline reconstruction from logs alone
  • systemctl restart tailscaled was sufficient to recover; no manual route surgery required

What Did Not Go Well

  • Response time was 3.5 hours: Uptime Kuma caught the outage at 15:40 but the alert was not acted on until ~19:00, when investigation began from a hotel while traveling
  • The 2026-04-10 action items did not prevent recurrence. ManageForeignRoutingPolicyRules=no only covers ip rules, not table 52 routes, so the class of bug was not fully closed
  • needrestart restarting systemd-networkd after a kmod upgrade is unnecessary and causes cascading damage to unrelated services. This is a needrestart configuration gap, not a kernel requirement

Lessons Learned

  • ManageForeignRoutingPolicyRules=no in networkd.conf protects ip policy rules but not routing table entries. Tailscale's table 52 throw routes remain vulnerable to any networkd restart
  • Tailscale's recovery path for RTM_DELROUTE events has a race condition with networkd mid-restart. The only reliable recovery is a full systemctl restart tailscaled once networkd has stabilized
  • needrestart will restart systemd-networkd (and potentially tailscaled) when packages like kmod are upgraded, even when no restart is necessary. Services that own kernel routing state should be in a needrestart blacklist
  • Tailnet subnet route health should be monitored from the tailnet consumer side, not just from the advertising side. tailscale status showing a route as "advertised and approved" does not mean packets are actually being forwarded correctly

Action Items

  • Restart tailscaled to restore the missing throw route in table 52
  • Add systemd-networkd and tailscaled to the needrestart blacklist in /etc/needrestart/conf.d/incus.conf to prevent auto-restart of either service during unattended upgrades (see Incus Server: needrestart Configuration)
  • File a Tailscale issue describing the race condition in the RTM_DELROUTE recovery path that leaves table 52 routes unrestored when networkd is still tearing down