Tailscale Subnet Route Lost After networkd Restart, 2026-04-10¶
Summary¶
The Incus server acts as a Tailscale subnet router advertising a handful of 192.168.1.0/24 host addresses into the tailnet. After a systemd-networkd restart wiped ip rules earlier in the day, Tailscale's recovery path only restored the ip rules and skipped a full route reconfiguration, leaving the throw 192.168.1.0/24 entry missing from routing table 52. Any tailnet peer trying to reach a 192.168.1.x host hit a routing loop on the Incus server instead of being forwarded to the physical LAN.
Situation¶
The Incus server runs Tailscale (1.96.4) as a subnet router, advertising:
10.10.10.0/24(Incus internal subnet)192.168.1.3/32,192.168.1.4/32,192.168.1.90/32(physical LAN hosts)fd42:10:10:10::/64
Tailscale manages its own policy routing via table 52. For each advertised subnet, it installs a throw route in table 52, which causes lookups for those destinations to fall back to the main routing table so the packet gets forwarded out the physical interface enp1s0 instead of being caught by the table 52 default dev tailscale0 entry.
The Incus server also uses the Sakura VPS as a Tailscale exit node for outbound traffic, which continued to work correctly throughout this incident.
Timeline (JST)¶
| Time | Event |
|---|---|
| 15:47:13 | apt-daily-upgrade.service runs, unattended-upgrades starts |
| 15:47:25 | unattended-upgrades upgrades the incus package and triggers incus.service to stop |
| 15:52:25 | incus.service finishes restarting, around 5 minutes to gracefully stop all running VMs and QEMU processes |
| 15:52:46 | Incus brings incusbr0 and other virtual interfaces back up. systemd-networkd restarts in response and flushes ip rules. tailscaled logs router: somebody (likely systemd-networkd) deleted ip rules; restoring Tailscale's. The ip rules are restored, but the throw 192.168.1.0/24 entry in table 52 is not re-added. throw 10.10.10.0/24 remains present |
| 15:53:34 | Monitoring flags the 192.168.1.x hosts as unreachable from the tailnet, about 48 seconds after the networkd restart |
| ~21:30 | Investigation starts. tailscale status on the subnet router shows the routes still advertised. ping from a tailnet peer (lavie, 100.119.140.15) to 192.168.1.90 fails |
| ~21:35 | tcpdump -i tailscale0 shows the ICMP echo request arriving from 100.119.140.15, immediately followed by a looped copy sourced from the Incus Tailscale IP 100.98.98.126 to the same destination. tcpdump -i enp1s0 shows nothing. Routing loop confirmed |
| ~21:38 | ip route show table 52 shows throw 10.10.10.0/24 and default dev tailscale0 but no throw 192.168.1.0/24. Root cause identified |
| ~21:40 | journalctl -u tailscaled reviewed, confirming the systemd-networkd flush at 15:52 and Tailscale's partial recovery |
| 21:42:34 | systemctl restart tailscaled forces a full Reconfig, which reinstalls all throw routes in table 52. Monitoring recovers |
Total downtime: 5 hours 49 minutes (15:53:34 to 21:42:34).
Root Cause¶
The trigger was unattended-upgrades upgrading the incus package at 15:47:25, which restarted incus.service. When Incus came back up at 15:52:46 and recreated its virtual interfaces (incusbr0 and friends), systemd-networkd restarted in response and flushed routing policy rules it considered unmanaged.
The actual bug sits in Tailscale's recovery path. Tailscale's route monitor detects the flush and logs router: somebody (likely systemd-networkd) deleted ip rules; restoring Tailscale's, which shows it knows what just happened and is meant to put things back. In practice it only re-installs the ip rules at priorities 5210 to 5270 and does not re-sync the routes inside table 52. A full wgengine: Reconfig: configuring router would reinstall both, but the recovery path skips that step.
In this case the throw 192.168.1.0/24 entry was lost and never put back. With the ip rules in place but the throw route missing, any packet destined for a 192.168.1.x host hit the table 52 default dev tailscale0 route instead of falling through to the main table. The packet was then re-emitted on tailscale0 with the Incus Tailscale IP as source, looped back into Tailscale's forwarding path, and never reached enp1s0.
The throw 10.10.10.0/24 entry happened to survive, which is why the Incus internal subnet kept working and masked the severity of the problem until a tailnet peer tried to reach a physical LAN host. Why one throw route survived and the other did not is unclear from the logs and is part of what makes this worth reporting upstream.
Because the trigger is any incus package upgrade, this chain will repeat on every future Incus update unless something in the chain is broken.
Impact¶
- Tailnet peers could not reach any
192.168.1.xhost advertised as a subnet route through the Incus server - The Incus internal subnet (
10.10.10.0/24) was unaffected because itsthrowroute survived the partial recovery - Outbound traffic from the Incus server through the Sakura VPS exit node was unaffected
- Public-facing self-hosted services were unaffected, since they do not depend on the
192.168.1.xsubnet routes
What Went Well¶
- Monitoring flagged the failure about 90 seconds after the networkd restart, so the incident window was bounded by response time rather than detection time
tcpdumpontailscale0andenp1s0immediately showed the routing loop: request in ontailscale0, looped copy right after, nothing onenp1s0ip route show table 52made the missingthrow 192.168.1.0/24entry obvious once the comparison withthrow 10.10.10.0/24was made- The
tailscaledjournal clearly logged thesystemd-networkdflush event, pinpointing the trigger systemctl restart tailscaledwas enough to restore correct state, no manual route surgery required
What Did Not Go Well¶
- Response time was almost 6 hours: monitoring caught the outage at 15:53, but the alert was not acted on until ~21:30
- Tailscale's recovery path logs that it is "restoring Tailscale's" rules, which reads like a full recovery, but it only restores ip rules and not table 52 routes. The log line gives false confidence and the failure mode is invisible from the journal alone
- Nothing on the Incus server reloads Tailscale's routing state when
systemd-networkdrestarts, so any future networkd restart is likely to reproduce this
Lessons Learned¶
systemd-networkdwill flush routing policy rules on restart unless explicitly told not to. Tailscale recovers ip rules but does not always re-sync table 52 routes- Tailscale's subnet router installs
throwroutes in table 52 for each advertised subnet. If one of thosethrowroutes is missing, packets for that subnet hit the fallbackdefault dev tailscale0entry and loop back out the Tailscale interface - A routing loop in a subnet router shows up in
tcpdumpas a duplicated packet: the original from the tailnet peer, immediately followed by a copy sourced from the subnet router's own Tailscale IP - A full
systemctl restart tailscaledforces a complete router reconfiguration, which is sufficient to recover from this partial recovery bug
Action Items¶
- Restart
tailscaledto restore the missingthrowroute in table 52 - Set
ManageForeignRoutingPolicyRules=nounder[Network]in/etc/systemd/networkd.confso networkd stops flushing Tailscale's ip rules when it restarts. This breaks the chain at the networkd level, regardless of what triggered the restart - File a Tailscale issue describing the partial recovery path that restores ip rules without re-syncing table 52 routes
Considered Alternatives¶
PartOf=incus.serviceontailscaled: would tie Tailscale's lifecycle to Incus, but would also stop Tailscale whenever Incus is stopped for planned maintenance, cutting off tailnet SSH access to the server while Incus is down. RejectedExecStartPost=systemctl restart tailscaleddrop-in onincus.service: would force a clean Tailscale re-init after every Incus restart. Works, but only covers the Incus trigger. Anything else that causes a networkd restart would still reproduce the bug. Not selected- Exclude
incusfromunattended-upgrades: would trade automated security updates for manual control. Does not fix the underlying mechanism and only papers over this particular trigger. Rejected