That time the RAID cards synced up
The Problem
It's Hadoopland. I hear the grumbles about a weekly occurrence; the cluster goes to shit on the weekends. How many weeks? Long enough to be certain it's weekly!
Investigation!
Any time I hear about a periodic occurrence, cron
springs to mind. Well... cron-like things. Systemd timers, runit tasks, containers, the uszh. I found loads of cron jobs. None of them set to get upset on the weekend though.
Serendipity
I was following the existing efforts in Slack when a thread caught my eye: discussion about our 'RAID' setup... was actually JBOD! (duh duh DUH!).
The setup becomes clearer. Each machine has a dozen or so disks, one of which is an operating sytem drive. The rest are configured as Just a Bunch of Disks (JBOD). Not a faux-JBOD with a bunch of single-disk RAID-0s either! This isn't actually surprising in any way; Hadoop was set to use each spindle on its own and handle failures all the way up at the HDFS layer.
By sheer luck, I had been perusing the manuals for the RAID cards. They have this feature called 'Patrol Reads'. To make a long story bearable, the card itself will scan all the disks as part of an early detection scheme. This extra read load has performance implications, so these controllers politely wait until a lesser-loaded time to impose on the disks.
Hypothesis
The RAID cards are doing their patrol reads on the weekend, and the extra load is causing the disks to fall behind on their HDFS replication.
Testing
To collect data, I wrote a script to poll the RAID cards for their patrol read status. I fed this to DataDog and whipped up some graphs.