Evaluating the Data You Are Using to Manage Production Risk (Part 2)
Using the right IT analytics techniques reveals which risks are present, how likely they are to result in production environment problems, and the scope of impact they might have. IT leaders and Change Advisory Boards (CABs) can respond to this data proactively, helping them manage these risks and — in many instances — prevent major incidents from disrupting business processes.
To accomplish the goal of using data to manage production risk, IT leaders have to be prepared to evaluate their own data from two standpoints:
- Creating informative metrics and Key Performance Indicators (KPIs) that can be monitored and investigated using in-depth interactive visualizations
- Performing quality assurance (QA) on the data to ensure that it meets the standards required for accuracy, descriptiveness, and consistency across all data sources
Best practices for point #1 above, i.e., establishing change risk metrics in IT operations, have been described in an earlier blog post. As for point #2, this requires three types of ongoing activities:
- Automated review of data for missing fields and outliers
- Data discovery to ensure all data sources needed to provide a complete view are being leveraged
- Auditing the models, metrics, and KPIs being used to ensure they are accurate and effective
Automating Data Quality Assurance for IT Production Risk Assessment
The most important step in between the gathering of IT data from various sources and analyzing it for insights is to ensure that the data can allow accurate apples-to-apples comparisons through standardized aggregation.
A canonical data model goes a long way towards establishing this parity; it sets in stone the fields and meta-data expected of a piece of data before it is allowed to intermingle in a pooled data resource. However, a canonical data model cannot inherently repair the damage inflicted by data that is woefully incomplete. Missing fields contribute to data skewing, which in turn yields skewed insights.
IT process changes can ensure more complete data by emphasizing the importance of using all prioritized data fields. Algorithms can “clean up” behind human employees who still miss these fields. AI can automatically identify missing fields, and machine learning tools can complete them with an increasing degree of accuracy over time.
Algorithms can also gradually rank the importance of certain fields as they piece together important relationships between certain data values and corresponding risk of change-related incidents. These models can not only predict possible change-related incidents, but they also emphasize which risk factors act as the strongest signals of favorable conditions.
Humans must also play a part in critical decision-making for how to set policies that ensure better data quality. They may need to determine ways to optimize fields, such as revising or completely removing fields that have a trend of blank entries or being used improperly (although algorithms can assist in this). They must also decide on how to handle data outliers — whether to include them in the analysis model or categorically exclude them to prevent excess noise.
Retracing Your Steps to Account for all Necessary Data Sources
Some IT risk management processes neglect to include all of the source data from all applications being used. While gaps in data sourcing may be an issue of convenience or vexing challenges, omitting key applications as data sources can easily lead to an inaccurate view of IT operations risks.
A helpful practice can be for IT data evaluation teams to audit the current data in use and ask, “Where does this metric come from?” Metrics produced from multiple data sources tend to be more robust, reliable, and accurate thanks to the complete 360° view they provide. Metrics produced from a single source may be subject to biases, inaccuracies, or blindness to relevant secondary factors.
IT leaders should aim to have daily operational activities and logs serve as the primary source for data analysis. While the task of integrating all relevant applications, ITSM systems, and monitoring services can be difficult, pre-built adapters can accelerate the process through automated data source integration.
Evaluating Machine Learning and Analytics Models
The models IT relies upon to predict risk and avert major incidents should not be taken for granted. They must be constantly tested for their predictive capabilities and their track record of accuracy.
IT leaders can automate analytics model evaluation by establishing metrics that measure their accuracy and effectiveness. They can ask questions like:
- How much has this model reduced the types of problems or incidents it is supposed to predict?
- In training data sets, how often has the model accurately predicted an incident related to a particular change?
- Does this model agree with other risk evaluation reports and human anecdotal observations?
Models that have a high degree of inaccuracy or a low signal-to-noise ratio may need to be adjusted to account for their “house effect,” similar to how certain political poll aggregators adjust for polls that tend to go consistently beyond the expected margin of error.
There’s also something to be said about models that have been tested across multiple organizations. While no two production environments are alike, pre-tested solutions can eliminate much of the trial-and-error process to allow for quicker time to value and more operational agility overall.
Evaluate Your Data Sources for Clearer, More Accurate Insights from IT Analytics
The value-producing step of analyzing data to produce metrics and assess IT production risks cannot be done if data sources are incomplete or taken at face value, despite inaccuracies or quality issues. The right data from the right source with the right level of quality can be used to create advanced behavioral KPIs, such as the “change manager credit score” described in our recent blog “Establish Change Risk Metrics to Drive IT Agility”.
But these insights can only be achieved if data quality isn’t taken for granted. Perform the necessary work, identify issues within IT Service Management (ITSM) processes that can lead to incomplete data fields, and audit model/KPI performance regularly.
All this work ensures you can actually know what you know when trying to proactively assess and mitigate risks in the production environment.
Learn more about reducing risks to IT Operations in the production environment in our White Paper co-authored by EMA: “Change Risk Mitigation Best Practices – How To Significantly Reduce Change Management Risk in Production”
4 IT Cost Reduction Strategies to Keep Your Organization Sustainable
Getting More Done With the Same Size IT Team In the current environment, cost-cutting is…
ITIL Change Management Best Practices That Can Reduce Operations Disruptions
With so much uncertainty, having established IT operations processes and heuristic models to work from…
What Does AIOps Mean for Digital Transformation?
AIOps is a powerful alliance formed between DevOps, analytics, machine learning (ML), and artificial intelligence…