DNS can be a nightmare

2024-04-29

I'm a simple dog. I like my coffee black, my music at a reasonable volume, and my networking fast. Lately, I'm been looking to further that third one by way of reducing network latencies and calls where I can. While poking around my NextDNS dashboard, as I do from time to time, I realised that wow, my devices are making a lot of queries to NextDNS servers. Surely I can reduce the amount of time it takes to resolve a DNS query by caching those responses closer to my devices?

In my line of work, we talk a lot about "running at the edge", as close to the people using the service as possible. What's closer than physically beside my desk, on my home subnet? Therefore, I set out with a simple task - build a shared DNS cache for my home network.

Conceptually, this is not an unusual task. Routers usually do this by default, plenty of clients cache DNS queries, and so on. So I was fairly confident it would be easy to do - couple LXC containers running in Proxmox, connected to Tailscale, and I'd be golden. The idea was to run two dnsmasq instances, one at home and one in the Floofy.tech Proxmox server (for redundancy more than anything), that all the devices on my Tailscale network would use for DNS queries. Those instances would forward to the NextDNS profile I have setup specifically for those devices (with several DNS rewrites to point to Tailscale IPs) and then cache the response so that subsequent requests from the same or other devices would be much faster.

I'll admit, the setup of dnsmasq itself was very smooth. All I had to do was grab the configuration NextDNS generates for me and pop it into /etc/dnsmasq.d/nextdns.conf, then tweak the configuration of the cache size and interface it should listen on. dnsmasq itself is intended to be run, from what I can tell anyways, on a single device and only handle the queries from that device, and I found that the Debian package installs the init.d script with this in mind. Once I'd configured it, I pointed my Tailscale DNS settings to the machine's Tailnet IP, and it was off to the races! The results were pretty immediate (if somewhat imperceptible to the human eye), and I quickly set about setting up the second LXC in the Floofy.tech Proxmox server. I was happy.

https://cdn.gabrielsimmer.com/images/dns-query-time.png DNS resolution time dropped dramatically, time in ms

My hubris was met not with immediate consequence, but a slow, creeping one. I found that the Floofy.tech LXC was being tempermental, losing its network connection completely. It was later in the day, so I opted to just remove it from the Tailscale DNS settings and figure it out in the morning.

I woke up to my personal infrastructure on fire, but quietly. At some point during the night, the dnsmasq LXC on my personal Proxmox box also fell over, this time a little more dramatically, knocking the entire host offline. Groggy, I rebooted the machine and set about the rest of the recovery, disabling the Tailscale DNS machinery on my desktop and set about figuring out what happened (after I switched off "override DNS" in Tailscale's dashboard). As far as I could tell, it had nothing to do with dnsmasq, but rather the way the /dev/tun device needs to be mounted into the LXC container for Tailscale to function. My dreams of a lightweight container running the cache in tatters, I swapped to using virtual machines on both boxes.

Everything going wrong once again reminded me how badly things can end up if your DNS goes down or misbehaves. The result isn't always immediate, or obvious. It creeps up on you as caches expire, slowly cutting off your ability to connect to websites. This can be especially frustrating when the very thing you use to manage DNS configuration is a remote website.

Anyways! The result is fairly reasonable. I've cut down DNS resolution times a fairly significant amount, which was really the end goal. I also configured my router's DNS properly to route DNS requests through its own dnsmasq instance (no custom built router... yet...) so generally devices on my network get faster responses, even if I have less control over how it works. DNS rewriting with NextDNS still works, since the caches will query it for answers.

https://cdn.gabrielsimmer.com/images/dnsmasq-dashboard.png My dnsmasq Grafana dashboard using dnsmasq_exporter

https://cdn.gabrielsimmer.com/images/dns-nextdns-drop.png Grafana dashboard showing a substantial drop in queries to NextDNS

Now we'll see how long it lasts.