IT Operations Metrics That Can Predict a Looming Major Incident
Your teams have spent years building your organization’s networked systems and connected applications. Now they’re undergoing a major trial-by-fire. With so many employees working remotely, your business depends on reliable IT applications and services to keep their operations going, employees productive, and customers happy.
A single major incident can jeopardize business operations when we depend on our technology more than ever. But with IT analytics, organizations can recognize possible threats before they have a chance to cause major service disruptions.
Using Machine Learning (ML), an IT analytics solution can identify signals that act as warning signs for an impending major incident. These incident prediction signals allow IT organizations to proactively monitor for threats. When a combination of signals crosses thresholds that indicate that there is a high probability of an impending major incident, IT leaders can respond with the appropriate risk mitigation action.
ML algorithms automate the process of discovering which signals have the best predictive properties.
While the signals that predict a major incident may not directly demonstrate a cause-and-effect relationship, they do tend to appear in high frequency when an incident is about to occur. This risk prediction model is similar to the one the National Weather Service uses to predict possible tornadoes. Just like with tornado warnings, the goal is to identify a possible threat early. Then, those most likely to be affected can be warned, triggering a response that can make a positive difference.
The correlative factors the ML-driven risk prediction engine identifies tend to be neither random nor surprising. Several IT incident prediction factors pop up time and time again.
In our experience, some common signals include:
- Problem backlog / Average problem age
- Day of week / Day of month
- Planned change activity
- Technology health
- Days between major incidents
- Days since a major incident
- Minor incident growth rate
Problem Backlog / Average Problem Age
There can be many reasons to delay action on current problems, but the ignored problems of today become the recurring incidents of tomorrow. An inability to clear problem backlogs can be indicative of broader issues that make a major incident more likely.
Without a known pattern or benchmark to guide them, IT may not know when to declare problem volume a cause for major concern. There may be confusion or uncertainty. How many unsolved problems are too many?
ML can define that answer with reference to your unique IT environment. Risk factor identification models can discover problem volume or average problem age thresholds that indicate a high presence of major incident risk. Not only that, but the model may be able to identify incident topic clusters or common root causes.
Day of Week / Day of Month
Period-associated risk can arise for a number of reasons. Problems can be related to scheduled business activities that occur during certain times each month, possibly triggering a major incident. Incident risk can also be related to seasonal volume, such as an online banking system getting higher usage after scheduled paydays.
An ML algorithm can review incident frequency by day of the week, day of the month, and time of year to test if there’s any correlation among historical data. If there is a correlation, organizations can respond by committing more resources for incident management or problem resolution in advance of periods that tend to generate the most problems.
Planned Change Activity
When deploying changes to platforms, the specific type of change and the general volume of change can both have a large impact on incident volume. One Numerify customer even once reported that, “Change activity caused 67% of our problems.”
Even Gartner estimates that more than 80% of outages are change related, and high-risk upcoming changes can increase chances of a major incident.
AI models can be instrumental in predicting which changes are high-risk. When making decisions on when to deploy high-risk changes, the general presence of major incident risk should be an important factor. Potential risk remediation actions would include deferring all high-risk changes until major incident risk dies down.
Technology infrastructure usually doesn’t fail without warning, so it is important to listen for primary health signals. Some of the most important to monitor include:
- Incident volume: There are incident tickets that your observability systems have automatically created in your IT Service Management (ITSM) system. Each gets created when certain thresholds are breached or certain conditions are met. A spike in these tickets may point to an upcoming major incident.
- Service degradation / Service availability: Degrading service quality often points to an upcoming major incident.
- Critical Services Alerts: These alerts may be generated based on sustained threshold breaches in metrics like CPU usage, memory, or disk i/o. Events associated with these alerts can escalate to a major incident if left unchecked.
Technology health signals act as one component of incident prediction that can be considered in the context of other signals – such as a rise in user-reported incidents – to determine the best course of action.
Average Days Between Major Incidents / Days Since Last Major Incident
One method to build a risk prediction model is to consider major incidents as inevitable. From the perspective of historical incident patterns, the model can make predictions as to when the next major incident is likely to occur relative to the last one. Then, as the threshold time period approaches, IT operations teams can be especially vigilant for other major incident warning factors.
Despite the fact that incident occurrence dates can seem like a random pattern, models incorporating metrics like “average days between major incidents” and “cumulative days since the last major incident” can have a high degree of predictive success. As organizations take proactive measures to expand either metric, the model can be updated to reflect the current incident rate pattern.
Metrics relating to the days between incidents can also have a relationship to time-period-based metrics like day of the week or day of the month, allowing the major incident prediction engine to reveal complex relationships between the two risk factor types.
Minor Incident Growth Rate
It is critical to monitor the background level of minor incidents and problem clusters. Emerging incidents can sneak up and cause cascading problems that trigger a domino effect and a corresponding major incident.
Additionally, trending incident categories often reveal areas of concern in need of proactive action. Without this level of diligence, IT organizations are essentially waiting for small incidents to become big over time.
Incident Prediction Metrics Empower Proactive Action
Mapping correlations between risk prediction metrics and major incidents not only helps prepare teams, but it can also allow for the creation of a highly accurate causation model.
Even when correlating factors do not have a strict causative effect, they indicate the presence of risk in areas that should prompt IT leaders to sit up and take notice.
Expressive KPIs, like scoring models, can enable streamlined IT operations decision-making in response to measured major incident risk.
Over time, IT leaders can use a major incident prediction engine to spend less time grappling with the costs of major incidents and more time committing resources to innovation, lean operations, and agility. These changes allow IT organizations to shift from a reactive to a proactive response, while aligning them more closely with overarching business goals. Such a role has never been more important, when a single business service disruption could trigger major losses to reputations and revenue.
Download this e-book to demystify AI and learn how it can help you deliver better, more-reliable, IT services faster: Artificial Intelligence in IT Service & Change Management: A Primer
Proactively Improve Customer Satisfaction by Visualizing IT Service Data
Delivering Proactive IT Services: Part III While the global pandemic is forcing everyone to quarantine…
Data Sources That Can Measure and Improve IT Service Satisfaction During a Crisis
Delivering Proactive IT Services: Part II When business relationships are being strained by a global…
4 Ways Your Unstructured IT Data Can Sustain Business Continuity
As many of us continue to work remotely and the strength of our IT processes…