Editor’s note: This analysis originally appeared in Bob Sullivan’s newsletter The Red Tape Chronicles
There have been so many hot takes about the CrowdStrike disaster that I don’t feel any need to add mine. But when you see what Delta Airlines is *still* doing to passengers some five days after this one piece of bad code ate the internet, you can’t blame CrowdStrike any longer. This wasn’t a single point of failure. It was an event cascade.
Delta’s backup plan was to fail. Get used to it. Our digital age is teeming with what is often referred to as the “single points of failure” problem, and many large corporations just don’t invest in realistic backup plans. So, the backup plan fails. And by definition, you no longer have a single point of failure, you have an event cascade. An often preventable event cascade.
There’s a structural reason for this, and it’s one we simply won’t fix until there’s massive political will to do so.
The issue is simple. A truly workable “Plan B” is very expensive, and keeping it current is even more expensive — no Wall Street-driven company will ever invest the money unless it is forced to by some kind of regulation. Meanwhile, roughly half of America is instantly repelled by any mention of the word regulation. So, here we are.
I’ve been fascinated by the problem of backup plans for almost 15 years, since I wrote this piece — “Why Plan B Often Goes Badly” — for NBC News in the wake of the Fukushima nuclear power plant disaster. That case study offers plenty of lessons, and because it doesn’t involve the failure of a U.S. company or U.S. regulators, it seems a little easier to be clear-eyed about the blame.
Here’s the very short story: An earthquake caused a massive power outage, threatening the plant’s cooling capability. Then the subsequent tsunami knocked out backup generators. There were backup batteries (plan C, if you will!) but they only lasted a few hours, not nearly long enough to conduct repairs under trying circumstances like the aftermath of a tsunami. So, nuclear plant disaster.
This story is important because it shows that often the phrase “single point of failure” is a bit of a misnomer. What happened at Fukushima was an event cascade; multiple things failing at once. And that’s real life. Sure, Delta’s computers suffered a blue screen of death. But there was a fix shortly after.
Still, Delta found itself unable to engage in basic tasks like assigning crews to aircraft for days. I look forward to official explanations of this, but clearly, Delta’s Plan B was a miserable failure. And now the airline finds itself with an airport full of customers nearly a week after the initial problem.
To be fair, it’s impossible to plan for everything that might happen in life — there’s always the possibility of an ultra-rare Black Swan event. But a bad software update is hardly a Black Swan event.
The more common limiting function is this: spending on Plan B cannot be infinite. There’s always a risk calculation when investing in redundant off-site data storage, or extra fire suppression equipment, or battery backup size.
Then there’s the problem of training. As anyone who’s ever run a “tabletop” incident fire drill will tell you, your imagination only takes you so far. One cannot really simulate all main production computers going offline at once; doing so would require shutting down a company. Without an alternate universe or an amazing simulator nearby, the fire drill you are running will always fall short of training people for the real thing. The last time your company ran an actual fire drill, you realized this, I’m sure, as many critical employees simply ignored the blaring alert to leave.
So what will the fallout from CrowdStrike be? What should it be? Sure, the firm’s stock price will take a hit. Maybe some companies will switch to new software, though that’s unlikely. Delta will have to issue so many refunds (thank goodness for new FAA regulations!) I bet it takes some kind of one-time hit to its quarterly report.
So what? Will that really force better planning for the next software glitch? Perhaps if there were genuine competition in the airline industry, and angry consumers could vote with their feet by rewarding other firms. But most consumers have little or no real choice when booking tickets. So, I’m quite confident this will happen again.
Unless there is political will to change that. Because I promise you, the next software glitch is coming to your airline, or your bank, or you connected home, very soon.
https://ift.tt/zwVG4Rk July 24, 2024 at 12:10AM GeekWire
Post a Comment
0Comments