Yves here. While this post provides useful detail about the Crowdstrike 404, I must confess to not understanding the headline claim about liability shields. And the post does not explain that either. Delta is suing Crowdstrike with top litigator David Boies as lead counsel, so Delta sure as hell does not think Crowdstrike can bunker itself from the cost of its bad actions.
The post implies that the failure is regulatory, and results from Crowdstrike’s very large market share, as in it should have been a target of antitrust. But I don’t buy that argument from the perspective of liability. In contract and product/service dealings, most of all when you have big corporate buyers who are or can be assumed to reasonably sophisticated, the liability exposure revolves primarily around tort and contract claims. So I am at a loss to understand the liability theory here.
I have yet to see anyone say that Crowdstrike customers agreed to any sort of indemnification. And even if there were, it could conceivably be voided if Crowdstrike made false representations or operated in a manner not contemplated in its legal agreements.
The implicit argument may be one that applies to outsourcing generally: that companies can outsource tasks, but that does not amount to shifting liability for poor performance of that task to the vendor. If you hire an accountant and he screws up your books, which then leads you to underpay your taxes, the IRS goes after you and you then have to attempt to recover from the accountant. Most accountants get clients to agree to limit their liability to fees paid. But those clauses are often not enforceable. For instance, a liability cap in a “take it or leave it” fee letter with no negotiation of terms could be contested successfully, per a rough and ready list of pitfalls here.
Readers more knowledgeable about the twists and turns are encouraged to speak up.
The discussion of this topic is a reminder I really should read the Delta filing….if I can somehow find time.
By Lynn Parramore, Senior Research Analyst at the Institute for New Economic Thinking. Originally published at the Institute for New Economic Thinking website
July 19 dawned with despair as CrowdStrike’s update sparked a seismic cybersecurity disaster.
A miniscule code glitch transformed Windows computers into ticking time bombs, causing widespread crashes that paralyzed airlines, banks, hospitals, and government agencies. The fallout was massive: over $5 billion in direct losses for Fortune 500 companies, with healthcare and banking sectors facing nearly $3 billion in combined damage. Stranded passengers, disrupted 911 centers, and delayed surgeries underscored the disaster’s profound impact. Indirect losses for people whose plans and activities were interrupted will likely run even higher.
This debacle casts a glaring spotlight on the fragility of the cybersecurity industry — a brutal reminder of the risks inherent in a market where consolidation, lack of oversight, and inadequate testing breed vulnerability. With firms like CrowdStrike holding sway over critical systems, a single misstep can set off a chain reaction of chaos — a wake-up call for lawmakers and regulators to step up their game,
Digital security analyst Muayyad Al-Chalabi joins the Institute for New Economic Thinking to advocate for a more resilient and diverse cybersecurity infrastructure, identifying key players responsible for widespread failures that hit ordinary people the hardest.
Lynn Parramore: What exactly happened in the wee hours of Friday, July 19th?
Muayyad Al-Chalabi: CrowdStrike, a leading cybersecurity firm, monitors global threats by collecting and analyzing data to detect anomalies and malware. Essentially, it’s a massive surveillance operation. The idea is that they do that monitoring, and they learn things in order to “protect” the end devices — your laptops, your computers — whatever that end device may be. They issue updates to look for the anomalies so that they can stop them or quarantine them.
I believe they perform these updates multiple times a day. On July 19th, one of these updates had an error that caused laptops, computers, and other end devices to enter what they call “panic mode,” resulting in a freeze and a blue screen.
LP: The dreaded “blue screen of death”—the digital equivalent of your computer yelling, “I’m melting down right before your eyes!”
MA: Correct. Once that happens, there’s a recovery process, which is manual. Fortunately, the manual process on your laptop or computer is straightforward—though it might seem hard if you’re a layperson and not familiar with moving files around.
The challenge is that large businesses and enterprises use a security feature called BitLocker that issues a key, which is about forty-eight characters long. Each machine needs its own recovery key from some IT group. Since your machine is already locked up, you can’t access it to retrieve the key. Somebody has to give you the recovery key on the phone or some other method. That’s why the recovery process became very cumbersome and it took a long time.
In summary, an error occurred, and the recovery procedure, which is manual and requires a lot of elaborate stuff, took a significant amount of time.
LP: Why did this one buggy update do so much damage?
MA: The global spread of the infection happened very quickly. It began at 4:09 AM UTC. CrowdStrike rolled out a fix at 5:27 AM UTC the same day, but in the 78 minutes it took their team to address the issue, the infection had already spread worldwide. The distribution mechanism was so efficient that it quickly affected everyone.
LP: CrowdStrike is blaming the testing software that it uses for this error. Why wasn’t the testing better?
MA: I just think they got away with it for such a long time.
LP: So it wasn’t a one-off problem?
MA: No, it wasn’t. This is not the software. There’s two things going on. One had to do with CrowdStrike’s engine software, called “Falcon,” the cloud-based security platform that’s supposed to keep your endpoints safe and spot threats in real-time.
The other piece has to do with what they call “signatures.” These are the files that get updated to look for things. CrowdStrike calls them “sensors.” It’s a driver that sits inside the operating system. It’s that file – which is a very small file—that caused the havoc. So it’s not, I would think, an engine software error per se. Rather it’s the file that caused the operating system, which is the Windows operating system, to malfunction.
LP: Do you think the file issue was a pre-existing problem, or did it arise specifically because of this testing error?
MA: There’s no indication that the file error happened just because they made changes to their testing procedures or they implemented testing. I don’t believe that’s the case.
LP: What’s your view of who is responsible for the fiasco?
MA: There really four elements of responsibility. One is CrowdStrike. Second is Microsoft and the operating system. Third is the enterprises that accept these auto updates without checking on the effects it will have on them. And the fourth element is this whole regulatory system and the issue of market concentration of businesses and so on.
LP: A lot of blame to go around.
MA: Right. CrowdStrike issues these updates, and they get away with it for a long time. The interesting thing on the Microsoft side is that Microsoft has a process to do testing before a rollout, including drivers from various vendors.
It’s not just CrowdStrike that has these files; think about other companies like Netflix or the manufacturers of printers—all of these involve drivers. Microsoft has a procedure to test and validate these drivers before they’re rolled out to ensure everything works properly. However, there’s a loophole that allows some code to bypass the usual testing process. And that’s what CrowdStrike has been using so that they don’t have to go through the longer cycle of testing through Microsoft.
LP: So, a weakness on Microsoft’s side allowed CrowdStrike’s unsafe updates to go on for a long time.
MA: Yes, the reason goes two ways. If you are CrowdStrike, you want to do these updates frequently because you want to catch the bad guys, and testing through Microsoft might take you longer. You might be behind. This creates a dilemma: balancing the need for rapid updates with the time required for thorough testing. They got away with using the loophole to bypass Microsoft testing for a long time. But this time, an error popped up.
LP: What about independent testing? There seem to be indications that CrowdStrike made an effort to restrict it.
MA: I honestly don’t know. Generally, most security software companies operate the same way when it comes to handling proprietary information and so on. You’ve got McAfee and Norton plus others, and the biggest two are supposed to be Microsoft and CrowdStrike according to market research firms such as Gartner. I believe independent testing is usually not performed.
With large enterprises, however, things are different. For instance, some enterprises don’t allow updates to be rolled out without thorough testing first. This is because enterprises have specialized applications that interact with the operating system, and even minor updates to the operating system can mess up these applications. So updates are usually tested on our specific applications before they are rolled out. If the enterprises don’t do this, it’s a problem. That’s why I say that you have four areas of concern: CrowdStrike, Microsoft, the enterprises themselves, as well as the regulatory issues.
LP: Looks like a colossal market failure on several fronts: You’ve got seller-side problems with CrowdStrike’s questionable updating and testing practices and Microsoft’s failure to address vulnerabilities – something downplayed in the press. On the buyer side, many large companies are also neglecting proper testing, while smaller companies lack the resources to do so. I empathize with smaller companies—they’re like patients relying on a doctor’s expertise, trusting these major players to provide reliable solutions.
MA: Yes, with the smaller companies/businesses, it’s an issue of trust. I would think they trusted both CrowdStrike and Microsoft. As for the larger companies, like Delta, for example, I’m sorry, shame on you.
LP: Because the big guys do have the money and the resources to make sure that these products work, and they didn’t bother to do it.
MA: Yes. Sometimes, you may need to prioritize thorough checks over speed, meaning it’s crucial to catch issues early, even if it slows down the process. People often take risks. And again, it’s a problem on the regulatory side, because these companies exploit liability loopholes. If liability were a factor, companies might be less inclined to take these risks, in my opinion. That issue lies in regulatory and liability laws, which effectively buffer Microsoft and others from accountability.
LP: So if these companies were held accountable, they might prioritize resilience more.
MA: Yes. Resilience has two components. One is stopping issues from happening, which I think is very hard. You’re going to get hit. The second, more important component, is how quickly you can recover when they do. You may not be able to avoid problems entirely — there are things outside your control — it’s like getting sick. But you do have some control over how effectively you respond and recover. The focus should be on improving the response and recovery when you do get hit. I think what the CrowdStrike situation has shown is that nobody was prepared for the recovery procedures. Or the recovery procedures were inadequate.
I want to give you an example of companies addressing the issue of retrieving recovery keys via automated systems when laptops lock up. The main challenge is the labor-intensive process of having to call into call centers to get the key. We developed a system that automates this process. You could call in, and based on the phone number, the user could be authenticated and receive the recovery key automatically without human involvement. Somebody still has to manually enter the key, but this system reduces dependency on call center staff and speeds up the recovery process.
The CrowdStrike problem arose because many end devices required a recovery key, and there wasn’t enough staff to handle the high volume of requests, whether in large enterprises or small businesses. For instance, in hotels and gyms, it took up to three days to reboot machines simply because they didn’t have the recovery key.
The key to resilience is having a fast recovery procedure. It seems that many enterprises have not invested enough in planning what to do when things go wrong.
LP: And yet things do go wrong pretty often. With the CrowdStrike disaster, some were reminded of the SolarWinds cybersecurity breach where attackers inserted malicious code into updates of SolarWinds’ Orion software, compromising thousands of organizations, including U.S. government agencies and major corporations. The attackers reportedly breached the system using the absurdly insecure password “solarwinds123”!
MA: Network failures have impacted major companies like Amazon, Microsoft, and Google in the past, often caused by misconfigurations in files and routing tables. Recovery processes were slow and cumbersome. So that’s one group.
Then you’ve got the second group: hackers and other nefarious actors. In the case of SolarWinds, hackers targeted the company because it was the predominant provider of network management systems. By compromising SolarWinds, they were able to piggyback into multiple systems, much like a stowaway sneaking onto a plane to cause havoc.
With CrowdStrike, even if the issue wasn’t due to something nefarious happening internally, it raises a crucial question: who is protecting the protectors? What are their procedures for safeguarding themselves? These virus scanners hold vast amounts of data from businesses—such as traffic analyses, application usage, transaction frequencies, and more—through their surveillance and monitoring. CrowdStrike and other cybersecurity firms are sitting on a vast amount of sensitive information.
Recall what happened with the AT&T data breach. The hackers didn’t go after AT&T directly — they went after the cloud provider, Snowflake, and stole the data from there. for example, CrowdStrike uses Microsoft as their cloud provider and other security firms do the same.
Now we’re delving into dependencies: these vendors rely on each other, and in the end, only a few key players have a comprehensive view of everything.
LP: The “too big to fail” analogy has surfaced here. These companies, like the big banks, face little regulatory pressure or liability, as you note, and when things go wrong, it’s the everyday people who end up getting screwed.
MA: I use the analogy of the one percenters that cause havoc for everybody else. And guess what? In this case, it was exactly one percent of the system that went down (8.5 million according to Microsoft). But it caused major disruptions for everyone else. This highlights a bigger issue that nobody has looked at: even though we have what appear to be multiple vendors and players, if they all use the same small component—like one feature on an operating system—it can create massive problems. I don’t believe anybody has done a true supply chain dependency analysis.
What I mean by dependency is about impact, not just connections. A one percent failure can have huge consequences worldwide. In this case, a one percent failure essentially brought the entire global system to its knees. Hospitals in the US, Europe, and elsewhere couldn’t operate because they couldn’t access patient records, and MRIs failed because they relied on precise dosage information. This illustrates the cascading effects of such failures. We tend to ignore the one percent at our peril. China was not affected since they do not use US security firms such as CrowdStrike, though some US hotel chains in China were affected.
LP: What you’re saying highlights the fact that on the regulatory side, we need experts who truly understand the complexities involved. We need better models for systemic risk, more information on how these issues occur, and effective strategies for prevention and response.
MA: Yes. Some have defended Microsoft and argued that they had to grant third parties like CrowdStrike access to their operating system due to EU regulations, specifically the General Data Protection Regulation (GDPR), which mandated this access as part of their settlement with the EU. That’s their argument. I think that’s bogus.
Granting access without proper checks and balances is not what the regulation says. To build resilience and incentivize companies to improve, the focus should be on three key areas: First, ensuring the right talent is in place to handle security and recovery. Second, developing robust systems for recovery, such as BitLocker. Third, establishing effective processes to support these systems and ensure resilience. Both selling companies and buyers, like Delta and others, need to address these aspects to enhance their security and recovery capabilities.
It’s people, systems, and processes.
LP: What happens if we don’t tackle these issues?
MA: From risk management, these guys are thinking, oh, gee, it happens once a year, so let me take the risk. I think that’s the mode we’re in. So, from a risk management perspective, addressing these issues only once a year, once every three years, or just a single time is not good enough.
LP: How does CrowdStrike’s market share affect its role for big companies and impact interdependencies?
MA: There are other firms besides CrowdStrike, which holds an 18 percent market share. However, CrowdStrike services about 60 percent of the Fortune 500 and 40 percent of the Fortune 1000 companies. This means that many large companies heavily rely on CrowdStrike without fully vetting their products, creating a dependency on this single provider.
Small businesses have largely outsourced their cloud services to various providers, even down to simple apps for everyday services like laundry. We live in a world with extensive interdependencies. For example, a significant issue that wasn’t widely discussed with CrowdStrike was that airport parking lots became inoperable because payment systems for credit cards were down. People could not get out. At Baltimore Washington International Airport, hotels couldn’t check guests in due to a lack of access to credit card verification systems. I was able to check in because I was a regular guest, but it took them three days to validate my credit card.
Even though CrowdStrike has only an 18% market share, the rest of the market is still significantly dependent on that 18%, and this dependency has not been fully addressed or understood.
It seems we’re constantly relearning these lessons. Despite repeated experiences, businesses often seem surprised and need to learn again. Over the past 20 years, many companies have prioritized short-run revenue over resilience, which is crucial but costly. Investing in resilience costs money, but it’s essential for long-term stability.
LP: Focusing on short-term shareholder value often leads companies to skimp on security, but does it ultimately cost them more in the long run?
MA: Something I want to know, and it’s more of a question than anything else: How effective are these cybersecurity companies at stopping attacks relative to their cost? In other words, how many cyberattacks have they prevented over the last umpteen years compared to how much companies spend on them? It’s a question. No doubt businesses do suffer as a result of security breaches.
Supposedly, CrowdStrike’s website has a lot of information, They boast about how businesses have consolidated their security needs with them, presenting numerous case studies that claim customers saved six dollars for every dollar spent with CrowdStrike. Is that true? It’s on their website, but how did they come up with that number?
We’ve observed various issues that come in different flavors, but they often share a common root cause. First, the shift to cloud services has led to significant breaches, such as the theft of AT&T’s 100 million customer records. Second, major cloud providers like Amazon and Azure have experienced massive network failures. Third, there are vulnerabilities in management systems that allow for infiltration. Lastly, failures in surveillance systems have led to widespread disruptions.
Think about the airlines’ failures, planes potentially crashing because when sensors fail. For redundancy, three sensors are installed. When there is a mismatch between the sensors’ data, the majority rule is used, i.e. trust two sensors that give the same data. Sometimes, an event like a bird strike could make two sensors malfunction. There’s an overreliance these days on sensing, specifically sensing that is given to automated agents. We need to reevaluate how sensor failures affect automated systems. When sensors fail, the resulting issues can quickly propagate through these systems due to their rapid processing capabilities, amplifying the original problem.
The national security implications are significant due to these dependencies we’ve been discussing. I’m concerned about what can happen with the government increasingly relying on commercial companies, which handle extensive data collection. The business models of some companies include a lot of government data, and you see the CEO reportedly earning over a billion-dollar salary. Many of these companies are either directly owned or heavily influenced by billionaires, intertwining corporate practices with political interests—a complex issue that merits further examination.
LP: Right, billionaires are tightly linked to our regulatory system, as they are showing us right now in the election cycle.
MA: Yes. Reid Hoffman, the Linked-In billionaire, just gave $7 million to the Harris campaign, and then he goes on CNN demanding the firing of people at the FTC who regulate.
LP: He’s unhappy with FTC Chair Lina Khan, who has taken a strong stance against Big Tech, and is now demanding her removal.
MA: Lots of Silicon Valley is not happy with Lina Khan. There are legitimate concerns for smaller companies—the costs of regulatory approval can be prohibitively high. For example, I know of a company that was acquiring another for $40 billion. The regulatory approval alone cost $27 million. While this is a minor expense for big firms, it’s a huge burden for smaller ones. This cost might make them think twice about future acquisitions and instead, opt to license technology and hire a few smart people as an alternative. This shift can increase dependencies across the board in ways that aren’t always visible. Many hadn’t heard of CrowdStrike until a major incident brought it to attention. Smaller companies, though less visible, are deeply integrated into critical systems.
On the indemnity & lawsuit topic (of course I’ll start by saying IANAL, take with a grain of salt), many of the largest “enterprises” (customers in the Fortune 500) will negotiate custom legal terms that have a direct correlation with the amount of $$ they spend on a contract, especially the very first contract. So when you purchase $500k for a single year, you probably get boilerplate CrowdStrike legal terms but when you purchase $10M you get to extract your pound of flesh by spending 3 months negotiating legal custom terms which may include things like an indemnity $$ cap.
I don’t work for CrowdStrike so maybe their legal team doesn’t do this but it is very common in the cloud software sales world, I’d be surprised if Delta didn’t have their own custom legal agreements with CS.
Very helpful, thanks!
+1 to this.
I am currently running a small software vendor in the Atlassian space, with both customer hosted products, and Cloud based (actually Software as a service ). and I have experience working in software across sectors, services, in house and product.
Enterprises will often ask for different terms . Our sales price is not really high enough to justify it so we will push back. Or if the changes asked for make sense we might apply them to our EULA with all customers. Keeping track of different terms is hard and expensive for us.
Enterprises might want:
they might want variation in liability,
different SLA (service level agreement) – potentially with service credits for down time or loss of service
They might want guarantees related to data sovereignty – where your data is located. This can cause huge friction between European clients and American service providers. If you want a search term to start from look up GDPR and bounce from there.
If talking about customer hosted software they might ask for the source code to be available, or source code held in escrow in case of software supplier failure.
There might be specific terms and due diligence around the security aspects of service provision. Terms to look for here are around ISO27001, SOC2 (security standards), HIPAA for healthcare as 3 examples.
Those are top of mind when talking custom terms.
There are other quality standards from ISO, but they are not top of mind – which tells you something about how often people ask about them.
When it comes to IT availability the key phrases to look for are Disaster Recovery (DR) and Business Continuity Planning or Management (BCP). Both are referenced in ISO27001 and other standards.
Both are hard to test, expensive to implement and often politically difficult.
There is a big asymmetry here in incentives, gains, politics, visibility when it comes to thinking about preventing risk versus effecting other change. i.e. doing stuff that is expensive and mitigates risk can be much harder to prove value than an equally expensive programme that has tangible positive return (we didn’t get this bad thing happen ! vs I made this small good change happen).
and even testing is fraught.
picture the conversation
“I want to run a disaster recovery test”
“what does that involve”
“we’re going to fail over all our systems to this other data centre and see if the business still runs…”
“what’s the risk ? ”
“it might not work as we expect and then nobody can work for a day”..
It is not something you do lightly. and even then they test a SUBSET of the whole complex IT mess we live in, typically within one contractual boundary (one company, maybe some suppliers), rather than across many.
That is my experience anyway of enterprise IT.
there is another way but that’s another comment!
1. from what I read in the Twitter-sphere, it sounded like Crowdstrike (arguably) crossed the negligence line;
2. we need to see the Delta-CRWD contract. sounds like that this not is going straight to arbitration.
(Arguably) I’d bet that the Crowdstrike team justed wanted a mission-critical, high-profile client so i’d bet that the business side told their lawyer to give Delta anything that Delta wanted in the indemnity-liability legalese.
I doubt that Delta’slegal team would allow a mission critical function to have weak protections for Delta.
And Delta’s lawyers probably said, “yes, please. thank you..I demand XYZ, no exceptions.”
also the Crowdstrike CEO has a history of the “failing upwards” as CTO of McAfee, he oversaw another meltdown.
My suspicion is that Crowdstrike’s “growth at any price” mentality would result in a very pro-Delta contract when it came to the “boring”, “meaningless”, boilerplate-legalese.
I think the lesson that should be learned, is that continuing to centralize Data and systems “on the cloud”
makes the system LESS resilient.
It seems to me we’ve gone from Main frames to PC’s and now back to even larger main frames called “The Cloud” . Progression or regression ?
My IT friends say that if your business depends 100% on the cloud, then you are at the mercy of the cloud provider. Some businesses have options for how much to put where, and complete reliance on what amounts to a single point of failure is too risky.
Your friend is right. but that is missing the situation beforehand.
if your business depends 100% on your internal IT team then you are at the mercy of your internal IT team.
It is the dependency that is critical and important to manage.
We are all at the mercy of those providing services to us. Whether it is internal or external (i.e. across a company / entity / contractual boundary) is moot.
I always remind people to substitute the words “The Cloud” with “someone else’s computer”.
In many many cases using someone else’s computer is better than your own. Costs are spread across multiple clients, as is training, hiring, usage, etc. You might typically pay for someone else when the skills are too specialized, or infrequently used for you to develop in house. I don’t see that as any different to paying for some other specialized service (like a particular specialist legal counsel).
You might develop these skills in house where ultimate control and flexibility is important. Banking springs to mind here. But the costs are high.
lastly … considering resilience of software systems is something that needs to be deliberate and tested. It’s not something that happens by magic. It’s something that needs to be intentional and built in – that is as true of cloud solutions as it is of own IT hosted solutions.
This is an informative interview. It covers a number of elements in the recent CrowdStrike debacle. But once again, there is no discussion, or apparently interest, in the very first question that arose in my mind when this failure occurred: how the hell did a company like CrowdStrike, with its proven past failures, dishonesty, willingness to provide false or misleading information favoring particular political interests, and ties to the intelligence community, become such a prominent part of our “cybersecurity” infrastructure so quickly? That seems like such an obvious question that I feel like I’m in an episode of the Twilight Zone when it is not discussed at all, even in passing. This is not a tangential issue in my opinion. The article asks this question:
“With CrowdStrike, even if the issue wasn’t due to something nefarious happening internally, it raises a crucial question: who is protecting the protectors? What are their procedures for safeguarding themselves? These virus scanners hold vast amounts of data from businesses—such as traffic analyses, application usage, transaction frequencies, and more—through their surveillance and monitoring. CrowdStrike and other cybersecurity firms are sitting on a vast amount of sensitive information.”
The importance of this question was certainly demonstrated. But I guess my question is: who is protecting us *from* the protectors? When this first occurred I joked that CrowdStrike’s fake Russian cyberattack software must have activated prematurely. But much of our intelligence activity, surveillance, propaganda, etc. is now laundered through the private sector. As the interviewee states:
“The national security implications are significant due to these dependencies we’ve been discussing. I’m concerned about what can happen with the government increasingly relying on commercial companies, which handle extensive data collection. The business models of some companies include a lot of government data, and you see the CEO reportedly earning over a billion-dollar salary. Many of these companies are either directly owned or heavily influenced by billionaires, intertwining corporate practices with political interests—a complex issue that merits further examination.”
I’m worried about billionaires as well. But no one seems to be discussing this event in the context of the growing integration of these interdependent commercial enterprises with our national security apparatus. So with that in mind, I ask again how this nefarious company was able to gain so much prominence so quickly.
I found this exchange to be useful,
LP: Looks like a colossal market failure on several fronts: You’ve got seller-side problems with CrowdStrike’s questionable updating and testing practices and Microsoft’s failure to address vulnerabilities – something downplayed in the press. On the buyer side, many large companies are also neglecting proper testing, while smaller companies lack the resources to do so. I empathize with smaller companies—they’re like patients relying on a doctor’s expertise, trusting these major players to provide reliable solutions.
MA: Yes, with the smaller companies/businesses, it’s an issue of trust. I would think they trusted both CrowdStrike and Microsoft. As for the larger companies, like Delta, for example, I’m sorry, shame on you.
The problem is rooted in the fact that the internet’s inherent vice is that it is an incredible surveillance tool.
What the public values as an incredible tool for communication and education, is valued by both business interests, and government, more and more in open collusion, for its ability to surveil everything about those of us who use it.
This fact is coming more and more to the fore as the deep corruption endemic in our economic/political culture reaches the point of saturation, where every organized endeavor is guided by leaders vetted primarily to ensure their absolute loyalty to the neoliberal consensus.
This has resulted in the degradation of every product and service offered by the “Markets”, by the constant corrosive effects of market concentration, price fixing, crony-capitalism, nepotism, fraud, and its attendant cost-cutting, corner-cutting and all-around toleration of every kind of cheating.
The neoliberal ideal of unfettered business has led to corruption of government, and with it, the universally bought-off politicians and corrupt intelligence agencies.
The fact that both business, and government interests have required, and enforced absolute, real-time access to all data immersed in the internet universe leads not only to the perverse abandonment of all individual rights to privacy, but also to systemic vulnerabilities that have recently plagued us in the form of cyber crime and wide-spread outages in vital systems.
The internet ideally would be a public utility, provided to everyone, at no cost, and secured by a competent government agency.
But because of our culture’s belief in the superiority of the “Private Sector” meaning unfettered capitalism, the internet has become the property of big business, that inevitably demands the right to monopoly behaviors and delivers shoddy products and services at exorbitant prices.
CrowdStrike delivered services to the corrupt DNC, and colluded with them and the Intel community to cover-up their underhanded behavior as regards our recent elections.
In return for these favors, and because of the name recognition related to their ‘service to the public’ they have been helped to acquire lucrative contracts with government and big business.
According to my sources, CrowdStrike discovered a new vulnerability and rushed to be the first to publish the solution, the rush entailed short-changing the testing phase of product development, by doing testing limited to a virtual environment, and not real-world conditions.
This rush to be first, led to cutting corners, and there it is SOP = FUBAR.
This is America, we’re exceptional, and this is how we do things.
A local law firm is seeking clients for a class action suit against CrowdStrike. I applied for a job there…and suggested a local firm as a potential client.
One wonders if they could bring up CrowdStrike’s past conduct as evidence…
“So it’s not, I would think, an engine software error per se. Rather it’s the file that caused the operating system, which is the Windows operating system, to malfunction.”
It seems that the “file” is essentially a list of patterns to check, e.g. to block denial of service attacks. The program that uses that list should not have the ability to crash the system, WHATEVER the content of that list. For example, if it relies on a correct format, that format should be checked first. Suppose that you rely on me to end sentences with dot, exclamation marl or question mark. But I fail to do it
—
It would be weird if you freeze upon reading such a line. A computer may attempt to store the content, and write blanks to some buffer until “\n” is encountered or it overflows… So you set a limit on the length of what you read “in one shot”. In general, it seems that there was an assumption that the freshly distributed file will not cause the program reading that file to crash because the file was written by CrowdStrike.
In Baltimore harbour, there was a protective structure for each bridge pillar which would work if a ship proceeding to those pillar had a course parallel to the channel direction, and who would guess it? A ship that lost power additionally lost the ability to maintain its direction. In short, it was idiotic. Whatever authority installed those “dolphins”, it was aware that there have to be multiple dolphins for each pillar, so I guess that the planned to add more later, but the spending priorities were altered. E.g. exactly at the crash time crews were fixing the pavement on the bridge.
It is yet easier to make such error in software, but whatever error was in the data file, a critical program should not crash the system.
The assumption you cited is very imprecise (not your fault but frustrating to see that happen). Crowdstrike had access to the kernel and Microsoft did not check updates that would affect kernel operations, as it should have, because reasons like too cumbersome and slow to have adequate safety…..for a security program!
Yves, you’ll probably be interested in this;
From Bits About Money;
Why the CrowdStrike bug hit banks hard.
Ads some particulars to my comment above.
Belle is correct here. Crowdstrike pushed a config file that the Crowdstrike kernel driver could not load correctly, so it crashed the kernel, requiring a reboot. But then the endpoint running Crowdstrike couldn’t boot up, because Crowdstrike was marked as required for boot, and the bad config file just crashed the driver again when the endpoint came up. Many of the affected computers were running on encrypted hard drives, which meant someone would have to enter a key to have the computer even boot.
Crowdstrike was avoiding kernel driver verification by putting changes in the config file, so MS wouldn’t have picked that up. However MS should stop letting third party code run in kernel space so they aren’t completely innocent. Apple no longer lets third party modules into the kernel and Linux has a kernel subsystem eBPF that is harder to crash.
All the endpoint protection/antivirus stuff is swathed in secrecy because the bad guys might find out something! so nobody is allowed to see what really goes on. All these things get to push secret updates on their schedule to all the most critical servers in an organization. I don’t think they actually do much beyond detect known malware – mostly they are good at marketing.
There is a decent explanation in this article: https://www.wired.com/story/crowdstrike-outage-update-windows/
I do not understand how the ability of driver to crash the system upon reading changeable config file is not a profound error in the driver. Normally, if you put illegal values in a config file for, say, vim, this line is ignored, so an intended alias does not work, coloring scheme remains as default, etc. Sure, catching all possible ways config file can be “illegal” is not simple, but its has established methodology.
From my limited experience, absorbing a config file is a one-time operation during a much longer program execution, so there should not be a performance problem in saturating it with “sanity checks”.
Yves – a great resource for reading about societal impacts and related considerations of computer security, especially failures, is Schneier on Security. Extremely well informed, technically precise, and – if you can face wading through Clive Robinson’s comments – the commenters often give valuable insights.
To give just one example of his perspicacity, he foresaw technofeudalism over a decade ago.
I’d put in more, but I’m on mobile, unfortunately. Apologies.
I occasionally listen to the Security Now podcast, put out by Steve Gibson, developer and security enthusiast, along with Leo Laporte. His stuff is usually very technical, and he tries to leave politics at the door (not always successful, and their views on China and Russia are tech boilerplate), but his most recent segment on the CrowdStrike bungle (Episode – Platform Key Disclosure) was fascinating.
A big benefit of his podcasts are the detailed notes and transcripts, if you would rather search and read. I didn’t always understand the topics, but at worst it’s ambient tech education. And his site is blessedly free of modern web trappings.
https://www.grc.com/securitynow.htm
If nothing else, peruse the notes and transcripts on the affair.
one more post on this… (finally something on NC I feel qualified to comment on!)
Our whole global IT landscape is… a complex dynamic picture of 10s of thousands of systems and types of systems.
Change and risk management in such an environment is hard.
There are two techniques relevant here to share (i hope useful)
1 is continuous X (continuous integration , delivery, deployment etc)
2 . is “canary testing”
so what is behind continuous delivery?
back in the old days… developers worked away for weeks, months or quarters, integrated all their changes and released all that change in one big go. Now the issue with that is that all that change (risk) is bundled into one big delivery. Failures were legion. Big releases were typically followed by a patch a few days after (a patch is a small software change). We were essentially concentrating the risks into one big release.
What is the approach now? By doing continuous delivery you spread the risk. In theory small risks incurred every day should lead to lower downtimes, and lower overall risk of failure. Continuous delivery in this way might mean releasing changes to your software EVERY day. some of the better teams at our work may release multiple times a day.
This is great for software teams – instead of trying to figure out which one of 1000 changes caused the problem… you learn quickly that the one single change you deployed was the issue. It becomes much simpler to deal with. and as hairless monkeys that make mistakes all the time.. simple is better.
What of Canary testing?
How do we know our simple changes are working ok for our customers?
this is where canaries come in. Like the phrase “canary in the coalmine” .
let’s say you have 10,000 customers.
Instead of deploying the change to all 10,000 in one go…
you might deploy to 10 internal “customers” first. monitor the change. do a test on a real system. wait 10 minutes.
then deploy to 100 of the smaller or more tolerant or just a suitable geographical spread of your customers. Then monitor the changes. See if there are support calls. Look at your operational systems for signs of errors.
then more phases potentially until you have deployed to all 10,000.
This phased testing approach dilutes and spreads the risk – rather than taking all the deployment risk in one hit you spread it. learn from it.
This is how ALL the big cloud and SaaS companies should or will be operating.
It has consequenes that are not obvious – e.g. it means there is no “one version ” of Facebook, instagram or google. there is the version I see – which might have multiple user interface experiments and be in different infrastructure canary groups. and there is the version you see that is running on a subtly different code and infrastructure stack.
hope that is interesting. They;’re pretty important concepts in distributed systems development.
I know nothing about these things, but the first thing I thought about upon hearing of the Crowstrike mishap, was why wasn’t the update tested on a smaller segment before it was released to the whole domain. Why aren’t these systems walled off from each other? The second thing I thought about and no one has directly mentioned it is: are our and other countries’ nuclear weapons systems vulnerable to this type of error?