How to prevent your software update from triggering a world-stopping outage

10 months ago 8

ARTICLE AD BOX

The beat of a butterfly’s wings can be felt on the other side of the world – or at least, so the Chinese proverb goes. The implicit message is, of course, that even the tiniest actions have consequences. In the case of a glitch-ridden software update to Crowdstrike’s internet endpoint protection system on 19 July 2024, that meant grounded passenger aircraft, GP offices being forced to use pen and paper, and at least one broadcast news station in the UK being forced off the air.

That episode earlier this year only illustrated the degree to which many aspects of our lives and the stability of commercial organisations are increasingly hostage to the integrity of one or more software platforms. The Crowdstrike outage alone is estimated to have incurred up to $5.4bn in possible damages for the now-beleaguered cybersecurity firm. For those other companies operating software platforms in multiple, sensitive spots in supply chains, the lesson is abundantly clear: update early, often and carefully, lest your organisation spark its own worldwide cyber drama.

The complexity conundrum

The Crowdstrike outage was hardly unpredictable. Neither was it unique. Rickety third-party software providers that have unwittingly gained outsized importance in their supply chain have inadvertently caused multiple wide-ranging outages that have impacted sectors from the restaurant industry to financial services. While the cause of each outage may vary, there is an overarching trend behind them. Modern IT systems and cloud environments have become too complex to control and manage using siloed toolsets. In fact, 86% of technology leaders say their technology stack has gone beyond human ability to manage, which makes it easier to make mistakes or overlook issues.

As worrisome as the potential revenue loss are the real-world consequences that outages can have for those who depend on digital services. For example, if a payment system at a supermarket is unavailable for any length of time, consumers may be unable to buy essential groceries or fill up their cars. A technical glitch at a hospital could delay patients from receiving life-saving care.

Digital dominoes

Organisations today exist in a world where IT systems behave very differently to the way they once did. As businesses continue to transform, their digital environments have become hyper-connected. A single disruption can trigger a chain reaction, rippling across multiple interconnected systems and services. Organisations can no longer think about the health of their systems in silos, but how systems interact within hybrid and multi-cloud environments and third-party services.

Take e-gates, for example. In partnership with the government, airports have introduced technology to ease the flow of travellers, reduce the reliance on staff, and have a more accurate view of who is entering and leaving the country. However, even if the e-gates themselves work, there is a whole chain of events that could impact the user experience. A flight may be delayed, disrupting planned passenger flow and leading to long queues and a poor experience for passengers.

If the airport monitors the user experience holistically – analysing the health of the e-gates in concert with other factors, such as flight arrival times and passenger footfall – it can make better decisions to optimize travellers’ journeys.

Beyond the reactive

There is no denying that the digital world is becoming more complex, but organisations need to continue to innovate without compromising service reliability or introducing unforeseen risk to their customers or business.

This can best be achieved with a proactive approach to managing the health of digital services. Observability platforms aim to have an array of monitoring, analytics and automation capabilities that enable teams to reduce the risk of outages and minimise impact when outages do occur. For example, synthetic monitoring can help to detect and resolve potential user experience issues early to avoid an outage and ensure fast action if an incident does occur.

Seeing through complexity

Part of the challenge in large IT environments is that there are so many problems that can lead to IT outages, including hardware failures, software bugs, cyber-attacks, and human error. As we’ve recently seen, even a routine software update can trigger a major point of failure. Organisations need a way to see the smoke before the fire starts to burn and take preventative action.

In this respect, AI-driven approaches to monitoring and observability are essential. Such solutions, if deployed correctly, can give teams real-time insights into systems health and help them to prioritise the actions they take to minimise disruption during an incident.

As well as revealing the source and cause of problems, these insights need to illustrate the impact of outages so that IT leaders have the information the C-suite needs to keep shareholders and other key stakeholders informed on their response efforts.

But it’s not enough to know that an application was offline for a given length of time – business leaders need to understand the impact on the outcomes they are measured against, such as the number of customers impacted, or the amount of revenue lost. Real User Monitoring (RUM) capabilities can be invaluable for meeting this need, giving teams a detailed view of user journeys and conversion rates to better understand the financial impact of an incident.

Ideally, then, software providers should have no excuse for preventing the kind of crippling, sector-wide outages the world is increasingly witnessing. Every organisation should consider the types of incidents that could impact their services and identify how they can be ready to respond quickly and minimise disruption when the next major outage strikes. To succeed, they need to change the way they manage and deliver IT services, taking a proactive approach supported by a holistic view of the business. Those organisations that succeed in making this shift will likely be amongst those leading the pack in an increasingly connected digital future.

Bob Wambach is the vice president for product portfolio at Dynatrace

How to prevent your software update from triggering a world-stopping outage

ARTICLE AD BOX

The complexity conundrum

Digital dominoes

Beyond the reactive

Seeing through complexity

Read more: Is the end of Windows 10 set to be an environmental disaster?

Related

Vibe coding is coming to the enterprise. Here’s how to do it...

Why parents in tech still feel the brunt of developer toil –...

The Future of the Digital Estate: Autonomous Endpoint Manage...

LEFT SIDEBAR AD