The 2024 Microsoft Outage: Retrospective

Tami Yousafi
|
Software
|
November 27, 2024

The date was July 19 when I got up at 5 AM. I was experiencing insomnia at the time, and I didn’t get a single minute of sleep that night. Nonetheless, I was excited for my short trip to LA. 

“I’ll just sleep on the plane,” I thought to myself. I got to the airport at 6:30 AM for my 9 AM flight, only to step into chaos. 

The United section of the airport was crowded with long lines of people sitting and lying down as if they’d been there for hours. I walked around, back and forth, throughout the United area. The lines to the check-in kiosks spanned farther than I can see. All the while, I saw no employees at the desks. I finally found an employee that was walking around and he said that all the systems were down. Not knowing what to do I just found a random line and stood there. None of the lines were moving anyway, so it wouldn’t make a difference which one I chose. 

All around me I saw people, some frantically and some dispiritedly, walking and standing around. People who were just arriving asked others what was going on. The employees tried to answer customer questions in a calm manner. I also saw some people recording short videos of the crowd to share to social media for the world to see what a pandemonium this was. 

I went on my phone and Googled “United airlines” to find out what was going on. I found that there was a Microsoft outage that affected several airlines, businesses, and companies. I thought to myself, “Wow, Microsoft of all companies? Really?” 

At one point, an employee walked around the crowd with a piece of paper listing destinations, canceled flights. Thankfully, LA wasn’t on the list. 

As I stood there, I constantly refreshed the news section of Google for any updates on the outage. I prayed that the systems would miraculously turn back on, and after about an hour, at around 7:45, I saw people finally walking from the bag-check desks to the entrance of the security checkpoint. I thought, “Are the systems back on?” In my quick thinking, I checked in on the United app, sending me to the bag-drop shortcut, which was a much shorter line that I could see the end of. 

I fast-walked to the security checkpoint, and as I waited for my bag to get through the TSA machine, to my dismay, it rolled to the side for an inspection. 

I thought, “How wonderful, something more to keep me from getting to my gate in time for my flight.” The time was 8:00. The TSA agent went through my bag, stopping at my pepper spray. 

Confused about why my pepper spray was a problem considering the last two times I flew it never triggered the TSA machine, I thought in my head, “Take it! I don’t care! I’ll buy a new one! Who needs personal protection? Just let me get to my gate!” 

I finally got through the security checkpoint, albeit minus a pepper spray, and fast-walked my way to the shuttles, through the airport, and passed other gates. 

By the time I finally got to my gate, 8:50 AM, I was breathing hard and my legs were aching. I took a seat and saw that my flight was delayed, but not canceled. I sit with a big sigh of relief. I relaxed my aching legs and caught my breath for a few minutes before getting up to get a much-needed coffee and breakfast. 

The employee at the front desk announced that the flight would be delayed indefinitely. I ate my breakfast, sipped my coffee, and read my book. In the back of my mind, I couldn’t stop thinking about how big of an impact this outage had on the operations of the world. 

At 9:40, they started the boarding process. 

“Wow, only a 40-minute delay,” I thought to myself. Fast-forward 4 hours and I landed in LA, got a Lyft to my hotel, and passed out on the bed. 

______________________

Now, I want to put on my developer cap and analyze what happened to me and everyone else in the world on this fateful day. 

The Problem

Designed by www.freepik.com

According to an article from CNBC, the outage happened because “cybersecurity giant CrowdStrike experienced a major disruption following an issue with a recent tech update.” The CEO of CrowdStrike described it as a “defect found in a single content update for Windows hosts.” While it’s not surprising that many companies and businesses rely on Microsoft, Microsoft relies on other companies and businesses, as well. 

Additionally, in an article from Baseline, the outage was a “DDoS attack” which was “exacerbated by an error in the implementation of the defense mechanisms, which amplified the attack’s impact rather than mitigating it.” It’s surprising to me that something like this could happen to even the biggest tech giants. I’m not sure what the tech departments look like over at CrowdStrike or Microsoft, but I imagine large teams of developers and quality assurance analysts looking over every piece of software before they release it. 

CrowdStrike also explained in an article from The Verge that “a flawed sensor configuration update was the culprit. The company blames a bug in test software for not properly validating the content update that was pushed out to millions of machines on Friday.” CrowdStrike is also “promising to more thoroughly test its content updates, improve its error handling, and implement a staggered deployment.” Indeed, CrowdStrike needs to ensure this doesn’t happen again.

While CrowdStrike has identified the issue, there is another theory that Microsoft’s recent layoffs have had a part to play in this event, as well. According to an article from the Digital Journal, “Microsoft was laying off hundreds of employees from its Azure cloud unit. This decision was part of a broader restructuring effort aimed at improving operational efficiency and focusing on core business areas amid a challenging economic environment.” The tech industry is in some turmoil right now with countless layoffs and hiring freezes, affecting thousands of people nationwide, myself included. Had there been more employees to thoroughly check the software, there’s a possibility that this issue may have not happened. 

The Aftermath

There is a feud between Microsoft and Delta (read more about it here). In summary, Delta is blaming Microsoft for its drastic decrease in operations thanks to the outage, but Microsoft is defending itself claiming that it wasn’t their fault, rather CrowdStrike is to blame. There’s many layers to this. It’s hard to assign blame to Delta and other businesses that were affected.

Future Prevention 

Designed by www.freepik.com

There was yet another outage on July 30, where some “customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door and Azure Content Delivery Network” (source from Information Week). It surprises me that there was another outage occurring not long after the July 19 outage. However, it wouldn’t surprise me if Microsoft and CrowdStrike take steps to prevent incidents like this from happening again. 

There have been reports that “Microsoft is announcing plans to make changes to Windows that will help CrowdStrike and other security vendors operate outside of the Windows kernel” (source from The Verge). This is a good solution that can serve as a cushion to land on for millions of users worldwide.

During my time in university, a phrase that I would hear is “Why reinvent the wheel?” If there is a service that is already created for consumers to use, then they should be able to use it. The question is, what will happen if those services cease to operate? 

Businesses that rely on services from other companies have every right to blame them for the cease of operations. In which case, they need to take responsibility, be transparent about what is going on, and fix it ASAP. Microsoft could have some backup systems in place to ensure that its millions of users and clients can still operate to a certain capacity. 

As a developer, something that I can learn from this is quality assurance and the individual steps to ensure proper functioning, such as development, stage, and production. Each stage must have multiple rounds of testing through different user stories and paths, and if an issue comes up, the teams can fix it before the issue sees the light of day. 

Needless to say, I sincerely hope that massive tech outages like this don’t happen again and, if they do, that they are handled promptly while backup systems ensure that businesses all over the world continue to run smoothly. 

(Background designed by www.freepik.com)

Profile picture of Tami Yousafi
Tami Yousafi
Software Developer

Recent Blog Posts

Get in Touch
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.