01 - Making a Splash

Initially, I started with our metrics stack - we’d have a host go down and our dashboard wouldn’t show it as down for about 40ish minutes… Not great. I hadn’t worked with Prometheus or Grafana before, and it took a week or two to wrap my head around PromQL. After learning how Prometheus was supposed to work, tangling with some PromQL queries, and talking to our on-call folks about time ranges where a host went down, I made my first discovery, one many are familiar with: our setup was non-standard. Instead of using node_exporter, we leveraged a custom script and an intermediate container, which Prometheus would scrape from. So instead of standard practice, we had a custom, undocumented solution.

I came up with some gnarly Prometheus queries that would pull data over a time range and would be more sensitive to stale data. When testing the query, a host was showing as down. I went over to one of the senior admins and asked “Hey, is host X down?”. He tried to SSH to the box: connection timed out. We had a short term fix for our metrics stack. My first win. We could catch errors earlier, our on-call people had something more responsive. My lead was happy, and I set out to figure out migrating to node_exporter and deploying it with Puppet. I was excited to have solved a problem, I had tackled the technical side of it.

But I had not thought about the structural side.

I was told I’d come back to the metrics work once we’d handled other priorities. I never would.