How to Test the Effectiveness of a Machine Learning-Driven Risk-Prediction Engine

How to Test the Effectiveness of a Machine Learning-Driven Risk-Prediction Engine
  • Homepage
  • Blog
  • How to Test the Effectiveness of a Machine Learning-Driven Risk-Prediction Engine

At a time when operational efficiency and business continuity emerge as top priorities, reducing risk from an IT standpoint has never been more important. Using Machine Learning (ML) algorithms, IT leaders can satisfy their performance goals by predicting the level of risk posed to an organization’s key business applications and services.

For instance, a ML-driven predictive analytics engine can determine whether a certain change creates the risk of a major incident, or whether the presence of certain risk factors indicates a possible major incident on the horizon. 

Use of these risk-prediction models has generated significant value to IT organizations, but they also raise an important question: how do you know if they’re working? In a time when IT leaders have to weigh each investment carefully, knowing the effectiveness of a ML-driven risk-prediction engine becomes critical.

Evaluating the effectiveness of an ML prediction engine can be quite tricky. For one, if the risk prediction engine works, then the organization will be able to take proactive actions and avert negative consequences. Therefore, whatever event or negative outcome the risk prediction engine portends will, ideally, not occur. 

But how does the organization know that the negative outcome was going to occur in the first place? In other words: How can you tell if the risk prediction engine was right versus the possibility that there was no real risk to begin with?

To answer this question, organizations can turn to two main methods of statistical analysis:

  1. Testing the risk prediction models’ ability to predict incidents using known data
  2. Benchmarking the rate of risk-associated adverse outcomes before and after the ML-driven engine was brought online, and then comparing the two

Both activities can be considered analytics best practices. They not only allow IT organizations to determine the effectiveness of the ML-driven engine but also help them calculate their ultimate return on investment (ROI). These efforts help them ensure that their risk-prediction engine is providing benefit and helping them minimize business disruptions.

Evaluating Machine Learning-Driven Risk Prediction Engines Using Historical Data

Historical data is your most powerful ally in evaluating the potential performance capabilities of a ML-driven predictive engine. It can be used to test the model’s predictive capabilities and also to retrain the model to seek optimal performance over time.

There are a number of methods available to test ML models using known historical data. A few of the most common ones are:

Test the ML Model’s Ability to Predict Recorded Past Incidents (Bootstrapping)

Bootstrapping is a primary method of supervised ML model training. It essentially gives the predictive analytics model access to data sets with known metrics and known outcomes. Then it will be given access to a data set with unknown outcomes. This tests the model’s ability to correctly predict an outcome without knowing whether or not it actually occurred.

For instance, the ML model is given access to historical IT data that may contain one or more metrics that act as risk indicators for a major incident. Without being able to know whether an incident actually occurred, the model can analyze all of these factors and make a risk prediction. If an incident did, in fact, occur during or immediately after the chosen assessment period, then a well-performing model will be able to successfully predict that the incident was going to occur.

A confusion matrix can be used to compare the volume of false positives and false negatives with the number of true positives and true negatives. This comparison tests the model’s ability to make accurate predictions and provides information to calibrate the model and improve its performance over time.

Compare the Model’s Performance to Random Results

Many tests measuring statistical significance evaluate whether a model’s outputs exhibit different behaviors than a random pattern. 

If, for example, someone came up with a function that randomly answers, “yes,” or “no,” in response to the question, “Will there be a major incident soon?” then that random model should have a much lower success rate than an actual ML-driven risk prediction engine. 

IT organizations can use this test to determine whether or not the ML models they use are, in fact, performing cogent analyses or whether their results are just as good as a random output — or maybe even worse!

Statistical analysis tools such as lift charts and decile tables can aid in this exploration, especially when a predictive analytics engine is supposed to generate a particular binary result.

Test the Model’s Ability to Detect Relationships Using Target Shuffling

“Correlation does not mean causation.” It’s a refrain you’ll hear in every introductory statistics course. It serves as an important reminder that certain correlations are just happy coincidences and not actually showing any sort of cause-and-effect relationship. Just like the outcome of the U.S. presidential race does not depend on whether or not the Washington Redskins win their last home game, certain correlations an ML model uncovers may not actually provide predictive value.

One method of ensuring that identified correlations are relevant and not just random is referred to as “target shuffling.” To perform this analysis, you can take a set of known historical data, shuffle the values, and then see what correlative relationships result.

As an example, a major incident prediction engine may have determined that a majority of incidents tend to happen on a Monday. During a target shuffling exercise, the field “days of the week” can be reordered so that the day of the week is effectively made random.

When this happens, the ML model should ideally not be able to find a statistically significant correlation since there is no true pattern. If such a correlative relationship is found several times and it possesses the same level of statistical significance as the previously-described relationship, then there’s a chance that the model is deriving signal from noise when there really isn’t any.

Automated Model Testing

All of the above analyses can be performed automatically in order to test, grade, and retrain models or promote models with best repeatable performance capability

Measuring Trends in Adverse Outcome Reduction

The second form of testing an ML-driven predictive analytics engine is to monitor the “before” and “after” occurrence rates for the given adverse outcome the engine is supposed to help prevent.

For instance, if the engine is supposed to predict possible major incidents by uncovering and measuring certain risk factors, then in theory the volume of major incidents will go down over time.

To test this hypothesis, organizations can aggregate all of their incident volume data for a substantial period before the ML-driven risk-prediction engine was limited. Say, for example, the organization took all of the data for every recorded major incident from a six-month period preceding the ML model’s implementation.

Then, the organization could obtain a benchmark rate by averaging together the volume of major incidents per month and then controlling for factors that may have created outlier data.

With this benchmark in hand, the organization can then repeat the process for a given six month period after the tool was fully operational and integrated into risk management processes. If the tool works, there should be a measurable difference in the volume of risk-associated outcomes before and after the tool saw full use.

Organizations can also trend the volume of risk-associated adverse events over time, allowing them to detect fluctuating levels in major incidents or other events. Ideally, this trend study will produce a graph showing the adverse events in question going down over time. 

If the trend graph does not reveal this downward volume pattern, then the organization should study whether certain risk factors grew or certain conditions changed to make adverse events more likely. Controlling for outside conditions ensures that your statistical analysis is objective and reflects the reality of your organization’s overall presence of risk over time.

Recognizing the Positive Benefits of the Prediction Engine in a Vacuum

The main purpose of a risk prediction engine is to help IT organizations predict and avert associated adverse events. However, a ML-driven predictive analytics engine can have secondary benefits thanks to its ability to highlight certain risk factors.

Predictive ML models are capable of automatically assessing risk factors and evaluating their correlation with adverse events in the IT environment. This methodology works similarly to how the National Weather Service (NWS) predicts tornadoes by looking at certain meteorological risk factors.

However, both ML-driven predictive engines and the NWS tornado prediction model will be prone to mistakes from time to time. That does not mean that their insights have zero value in these moments – what they do is bring awareness to certain risk factors while encouraging an overall culture of preparedness. 

If a tornado watch is the reason you remove debris from your yard and yet no tornado touches down, you have still made your environment safer in the event of future bad weather. Similarly, most risk factors identified by an ML-driven risk prediction engine indicate areas of negative IT performance. E.g.: A high volume of unclosed problems is bad for organizational performance whether or not it indicates a specific incident is looming on the horizon.

Risk factors that are non-performance based can still raise awareness. If incidents happen to occur on certain days or under certain user traffic patterns, then being alert can encourage proactive actions that are beneficial to IT service or operations performance even if there is no specific incident predicted.

Overall, ML-driven risk prediction engines can demonstrate their performance worth through analytics best practices, but they also lead to a culture where incidents are less likely overall thanks to the awareness they bring. This shift adds accountability to IT’s role in helping maintain critical business services while giving them more tools than ever to reduce disruptions when those services are needed most. 

Risk prediction engines play a pivotal role in improving IT organizations’ major incident management and response techniques. Learn more in our recent webinar: “How to improve major incident management using predictive analytics & AI”

Watch the Webinar

Related blog posts