Monitoring and Me

2023-09-22

And also a bit about Nix

I've had a bit of a mixed history when it comes to monitoring services I'm responsible for. A long while ago, I built Platypus that I believed would serve as a good method to monitor some of my machines, using websockets to stream some metrics to a host machine which in turn streamed the updates to browser clients. It did actually work, and probably still does, but in the years since building it I've learned a lot about monitoring and gaining insight into how infrastructure and applications are performing. What metrics can be collected, what metrics are actually useful, what is important to alert about, and so on. And thus, Platypus was no longer a good fit.

Initially when building Platypus, I assumed CPU, memory and disk usage were the only metrics I'd care about, and at first this was true. But as I gained more professional experience, and saw the wonderful graphs that could be produced by Grafana dashboards, I feel in love. Grafana especially is a wondeful bit of open source that I highly recommend getting familiar with. Displaying the metrics is relatively trivial though, especially since there are so many community dashboards for various systems. How do we actually collect the data?

The go-to for metrics collection (ignoring telemetry and logs for the moment) seems to be either Prometheus or DataDog. Having used both, I prefer the flexibility and free nature of Prometheus, but it also comes with the significant downside of being rather heavy - because it caches a lot of metrics in memory of queries, I usually see memory usage hovering around 2-3GB. This is usually trivial in a production deployment, where you'd have a dedicated machine running Prometheus scraping metrics from other machines/services, but becomes a little more notable when running within Kubernetes depending how you setup your node groups and pod scheduling. And this impact is furthered when running in a more resource-constrained environment like a Pi cluster, which was the situation I found myself in. The first instinct may be to reach for something like DataDog to ship metrics off-site. But wait! Prometheus can do this too!

Back in 2021, Prometheus introduced Agent Mode, which allows the Prometheus process to forward the specifically-formatted metrics it scrapes from metrics endpoints of an application to another Prometheus instance, which in turn will handle the actual querying of data. This was really wonderful for my use case, where memory is a concern but the benefits of using the Prometheus Kubernetes Operator made it almost essential. So, I quickly set about putting together a small remote machine to act as the ingest running both Prometheus and Grafana. I opted for a 2vCPU 4GB/memory virtual machine from Hetzner, running on their ARM offering. While I could have run this off of my NAS, which has plenty of power to spare, I'm of the mindset that for monitoring it's usually better to isolate from the infrastructure it's keeping an eye on where possible.

One may as "why run Prometheus in your cluster at all?". The answer is generally because the operator is able to query the Kubernetes API endpoints to automatically discover services that it can scrape for metrics. Running this externally would require you to generate access credentials for an external server to access your Kubernetes cluster's API, and expose the metrics API endpoints to the world. Running Prometheus within the cluster avoids this.

Naturally, I selected NixOS for the job. Using the wonderful (albeit perhaps slightly dated) nixos-infext, and Xe's guide, I had a simple server up and running fairly quickly. It acts as both a scraper, mostly collecting node-exporter metrics from my various machines over Tailscale, and the ingress point for the Prometheus Agent running in my Kubernetes cluster (you can find my NixOS configuration for the monitoring server here). So far, I'm kind of loving the experience and have no complaints. Except that sometimes node-exporter dies on my NAS, but that's not really a complaint with the overall setup.

Dashboard wise, I usually grab whatever the community has already put together, and use those as a starting point to discover metrics and how to present them in a useful way, slowly building my own visualisations on top. There is a wealth of dashboards available for any exporter under the sun, and you can always modify ones you import to fit your needs (and you'll usually want to at some point). Admittedly figuring out how to display data can be a bit tricky, and learning PromQL is certainly an experience, but Grafana offers a decent query builder and Prometheus has its own interface for querying the data set so it's fairly easy to explore the metrics being collected.

While collecting the metrics is useful, I don't want to sit in front of my screen and watch the graphs every moment of my day waiting to see something go wrong. Rather, I want the system to tell me when something happens. Typically I reach for Grafana's own alerting system for this, but this time around I went with Alertmanager, which interacts directly with Prometheus and performs the logic for forwarding alerts. It's not quite as robust as Grafana's offering, and requires editing configuration files to setup the rules and contact points, but theoretically it could be more reliable that Grafana's since Prometheus pushes to it rather than Grafana pulling from Prometheus. We'll see how this works out long term.

While writing this post (I started writing this on the 5th of September), it occured to me that I really want to keep an eye on repeat jobs, like systemd timers or cron jobs, but Prometheus can't really do that. So, I turned to Google and got pointed towards Healthchecks.io. Whenever one of these scheduled jobs run, it makes a GET request to my Healthchecks.io instance (as seen here). When the job starts, it makes an initial request to the /start endpoint, which allows me to track how much time the job took to run. If a job misses it's check in window, it can either send me a notification directly or, as I've configured it, Prometheus will scrape the read-only endpoint and send the failure to Alertmanager, which takes care of sending it to me (via ntfy). Some tools integrate really well with optional endpoints Healthchecks.io offers - borgmatic, which we use for backing up the floofy.tech database, automatically calls the start, log and end endpoints.

The next steps for this setup will likely be around experimenting with various Thanos components to archive metrics data. I don't really need long term retention of metrics, but it is a system I want to get familiar with, so ultimately that means I need to deploy it for myself. I also need to look at getting setup with logging - likely with Loki, since it intergrates well with Grafana, and is much more lightweight than the full Elastic Logstash Kibana (ELK) stack.