Skip to content

How a statsd pipeline saved $1M

TL; DR

Don't send UDP packets for every function call. Do batch tiny packets into bigger ones.

I had always kinda idly wondered "How much does shitty code cost?" Like, in dollars. Evidently I wasn't the only one, as I found a dashboard that broke down AWS spend by autoscaling group. As a member of the Streaming Platform team, I was surprised to see a streaming-related name bubbled up to the top of the list.

Curiosity thoroughly piqued, I pull at this thread and uncover an ingestion system with 1000+ shards and a matching number of boxes. Powering this leviathon is the KCL MultiLangDaemon, a Java process that handles the communication with Kinesis and spins up a process for each shard. That process, in our case, is a Python program.

One shard per box seems wasteful. It's single-threaded Python... I would expect to match the number of cores, at the least. I'm not a fan of speculating so I jump onto a randomly-chosen box and start poking around with perf and py-spy. I open the flame graph, find the chunkiest slice, and take a gander. UDP send.

Kinesis is happening over stdin/stdout, so I gotta look elsewhere. UDP to localhost:8125 is statsd. I'm not surprised, we use statsd everywhere. TCPdump reveals that we're sending a shitload of teeny tiny packets. I let my big, dumb, human eyes latch onto a metric name and follow it through the codebase.

It feels like Every. Damn. Function. is decorated with a statsd timer, meaning that once your decorated function returns, the timer helpfully calls send() for you. Calling one of these functions (or, shudder, a chain of them) means that you're going to be sending a UDP packet for every function call. We were sending something like 30 or 40 bytes at a time, when UDP can take 512. It adds up!

At this point, I'm pretty hopeful! This seems like such an obvious win I can't believe it hasn't been done before. All's we gotta do is stick a statsd pipeline everywhere we can, and bang! free money. This thing had a lot of loops-in-huge-loops, on top of itself being a record processing loop. I dunno what big-O(the alphabet) works out to, but I know it's a shitload of packets.

Threading pipelines through it wasn't as mindless as I had hoped, but it was still pretty easy. I had some idea about how many times each call was getting hit (because I had seent it in the tcpdump), which dictated where I decided to flush. Did I really need to put that much thought into where I flushed? Probably not, but I was having GC-pause flashbacks. If it's gotta pause, I'd rather it be consistent and predictable.

Anywho, I freed up enough CPU to smush way more shards on each box. This allowed us to shrink the ASG, which saved us a bunch of money. For my efforts I got to move to the Efficiency team and work on this stuff full-time.

Made with Vitepress