The year’s not yet half over but dozens of major companies (and possibly many more) have already experienced serious IT outages. Some, including those at TSB Bank, Woolworths, and NHS Wales can aptly be described as “meltdowns.” While no system is perfect, and it may indeed be impossible to completely prevent these increasingly common service interruptions, they should not be dismissed as inevitabilities. Taking a look at some of the year’s biggest IT mishaps will shed light on common causes for these issues and some possible resolutions such as smart IT. It will also inform us what organizations can do to get ahead of the next outage before it turns nuclear.
IT Outages: What's the Root Cause?
In theory, a service outage can occur for a number of reasons. When you have access to the right data, nothing is unforeseeable. And as any IT professional will tell you, certain root causes seem to arise time and again.
Change Process Failure
Most organizations have processes in place that should be followed when an outage occurs. However, if the company has low IT resilience, it will be unable to maintain a reasonable level of service when confronted with an outage. In short, even a minor disruption could bring an entire company’s operations to a near standstill, potentially costing the business thousands (or millions) of dollars in lost revenue and resulting in a lot of dissatisfied customers.
Earlier in the year, the Welsh branch of the UK’s National Health Service experience a country-wide disruption in its network. Patient information couldn’t be accessed, including test results and healthcare workers couldn’t even log notes from patient consultations. The pharmaceutical system was also affected, preventing the distribution of prescription medication. The cause of the outage was a network failure at both of the NHS’s data centers, located just 30 miles apart. On the surface, this type of scenario is rare and hard to predict.
That being said, a network failure in two data centers could be the result of a failed change simultaneously implemented across both networks. A System of Intelligence could have recognized a risky change through Machine Learning models which identify significant combinations of risk factors. In turn, the team could have developed better change rollout, data redundancy, and failover procedures.
From miscommunication to insufficient training and the misallocation of user privileges, human error is responsible for more than its fair share of IT outages. And this holds true across sectors. Even the most reliable brands appear to be susceptible, as was seen back in March when AWS’s S3 product experienced an outage.
The cloud-based storage service normally maintains a 100 percent Service-level Agreement (SLA). But when an employee debugged a technical issue and inadvertently took down multiple servers, East Coast clients were left out in the cold. Of those impacted by the IT outage, Atlassian, Twilio, and Alexa users experienced the most significant downtime.
Even banking, which requires the highest level of regulatory oversight, isn’t immune to human error. Ulster Bank, which is owned by Royal Bank of Scotland (RBS) has caught the ire of its customers many times over the last few years. Many claim this is the result of poor governance. The most recent incident, caused by a single employee, left thousands of customers unable to see or access monies in their accounts.
Better resource planning driven by a System of Intelligence could have prevented this banking IT catastrophe by ensuring that the right resources with the right skills and training were assigned to the right tasks.
While in the process of migration from legacy systems, TSB Bank experienced an extreme outage scenario, leaving 1.9 million customers without access to their accounts for a total of three days. Some customers were even shown details from other client accounts. According to various contractors, rushed and incomplete testing may have played a significant role. However, at the time of the incident it was hard to pinpoint whether the issue was software quality, hardware, or the network.
Had a system of intelligence been in place, management and IT could have improved the quality of software delivery by identifying issues like areas of the application with insufficient test coverage or whether there was sufficient communication between different teams. Additionally, the assurance of high software quality would have made it easier to pinpoint root causes like hardware or network issues.
Hardware Asset Failures
There are times when an IT outage is simply the result of a technical issue. Networking failures, faulty technology, software integration incompatibility, bugs, or even a basic system upgrade can all wreak havoc on business operations.
Woolworths supermarket in Australia got a taste of just how wrong things can go when its entire 500 store register system shut down for 30 minutes, leaving customers and employees alike bewildered. Root cause analysis showed that the mishap happened due to faulty technology, which an upgraded IT system could have prevented.
Can Smart IT Decrease the Number of IT Outages?
Smart IT, such as a System of Intelligence, is a powerful resource that can give companies the insights that they need to prevent outages and significantly reduce costly IT downtime. However, humans still have a big role to play in preventing outages.
It’s also important to keep in mind that no one except the companies involved know what information was available before the outages happened. Perhaps some organizations simply overlooked the signs that trouble was ahead. Others may have been completely unaware of what lay ahead because they lacked a System of Intelligence to alert them to the coming storm, or prevent it altogether.
One thing that we do know for certain is that many of the most common reasons for IT outages are indeed preventable. IT teams just need access to all of the data to make those IT outage preventions a reality.