TL;DR – Giants in the tech industry are falling left and right. Is Mercury in retrograde or something? Probably not- it just means no matter how big you are, tech will go wrong sometimes. Redundancy and mitigation are key. And you can learn something from this.
At about Midnight EST, Azure experienced a global outage that affected their virtual machine services, the causes of which are still being investigated. Initial statements point to a VM Guest Agent Extension being ported to a new platform using Azure Resource Manager, but a full accounting of what went wrong is still underway.
The outage lasted approximately 6.5 hours, at the end of which engineers were able to identify and mitigate the event and return service. Root Cause Analysis (RCA) is promised to be published within 72 hours.
Remember When Facebook Vanished?
Similarly, Facebook experienced a massive outage on October 4 that also lasted about 6 hours. Though, their root cause was due to updates that broke their border gateway protocol. The outage affected not only Facebook, but a number of their other services including Instagram, WhatsApp and others.
The Achilles Heel of Tech Giants is… IT?
So let’s be blunt and call it what it is… essesntially a couple of updates fried some of the biggest services on the internet??? Well… yeah, absolutely.
Having been involved on both the development and IT infrastructure sides of the tech world, we understand just enough to be dangerous. That said, we can confidently state that given the complexity of today’s IT it is a wonder any of it works at all!
A vast network of different technologies… internet protocols, data storage, applications and APIs, (oh, my!), coded in a variety of different programming languages by different teams of people across time with minimal standardization across these subsets of technology all need to work together to deliver these services at a scale so massive it boggles the mind. And 99.9 percent of the time… it works.
Are we really that surprised when something breaks? Are we that spoiled by modern conveniences that we bring out the tar and feathers when something goes down for a few hours?
Well… yeah, absolutely.
This is a perfect example of how tech dominates society rather than society utilizing technology. It’s odd to hear a tech blog take such a stance, I’m sure, but Ctrl Alt Develop is rooted in the idea of responsible and efficient use of technology to enrich, not to overtake, our lives and addict us to kitten videos.
Don’t mistake us to mean that these organizations shouldn’t be more careful in the future. Redundancy, Rollbacks and Recovery should be absolutely top of mind when doing updates on any system. But even the most redundant and backed up system is absolutely going to have issues at some point in its lifespan.
The Good News
Perhaps the best news is that despite our panic, our outrage… our unabashed meltdown in the face of not seeing memes or being able to manage our virtual servers (to post memes)… the world didn’t end. And everything went back to normal in about 6 hours.
In a world obsessed with “uptime”, the reality is that in our hearts, we know even the most well maintained and “too big to fail” systems… can still fail. There is no such thing as a perfect system, and that can be freeing to tech enthusiasts.
See, it means that you can experiment with tech and fail and learn and get better, and the world won’t end in the meantime.
If we outlined each time we took down our home network for a couple hours or the times we’ve had to recover our home servers it would be fodder for posts that would span years on this site. At the end of the day, though, we now have indoor automated gardens, home-based media servers, and network accessible file storage like dropbox without any of the monthly fees. All because we were willing to try, fail, have outages, and bring it back to life.
Sure, in many cases it’s very convenient to offload that work and responsibility, but just like learning to maintain your car, even if you decide to have someone else do it for you, you should at least know how to do basic maintenance.
This is the key to controlling tech rather than having it control you. See you in the “how-to” posts…