How to Make Decisions to Mitigate Change Risk (Part 4)
Once risks in the production environment are identified, an organization must make appropriate response decisions without unnecessary delays. Making response decisions in an agile way requires two things: information, and a framework for interpreting that information in a way that prompts decisive action.
Generally speaking, IT operations has four options when responding to risk:
- Acceptance of the issues or vulnerabilities a specific change might have
- Mitigation of the risk, minimizing its ability to affect critical services
- Elimination of the risk by solving issues preemptively, modifying the planned change, quarantining the issue so that it cannot affect vital services, etc.
- Avoidance, where IT leaders or the Change Advisory Board (CAB) decides that a specific change should be delayed or that all changes should be frozen until the risk can be reduced to acceptable levels
Data obtained from expressive, high-order change risk Key Performance Indicators (KPIs) and depicted in a visual way can allow IT operations to rapidly assess change risks and respond according to their policy. Data visualization in IT is crucial so that the possible scope, impact, and likelihood of a risk can all be understood.
Certain risks may reflect relationships between critical IT performance metrics; having a visual dashboard is, therefore, essential for obtaining both a diagnosis of current possible issues and a global picture of the health of the overall production environment.
When metrics alert IT leaders to possible risks, a dashboard’s drill down capability allows them to pinpoint root cause or examine the risk on a micro level. This facilitates more targeted and effective actions that strike as close to the heart of root cause as possible.
Depending on the level of sophistication of the IT analytics product being used, predictive analytics and Machine Learning (ML) can aid agile decision-making by rapidly prescribing risk response actions based on past protocols.
Using metrics to plot a production risk on a risk remediation continuum
When assessing a risk, plotting it on a visual framework can help establish its relative threat to an organization’s services. One such visual framework, recommended by analyst firm Enterprise Management Associates (EMA), is based on two axes:
- X axis: The risk’s probability of causing an issue or failure
- Y axis: The risk’s projected impact on critical business services
Organizations can use this framework to rapidly determine the best course of action when faced with a production environment risk.
- Low likelihood, low impact risks can be accepted and largely ignored
- High likelihood, high impact risks must be avoided, possibly by taking strategies to delay or modify the proposed changes
- Risks that fall at the high end of one spectrum but low on the other can be dealt with in ways to eliminate them
- All other risks may be mitigated or have their impact reduced through various methods
Using expressive KPIs can help to reveal a full picture of a risk, including its potential sources. If, for example, changes from certain teams have a high correlation of causing performance issues in the production environment, then a high proportion of contributions by that team on a change deployment can signal that extra caution and analyses are needed.
As EMA wrote: “Being able to proactively identify production change risk enables an enterprise to make calculated decisions regarding the degree of production change risk they are willing to accept, as well as how they should consider the mix of people, processes, and technology to eliminate or reduce their production risk exposure.”
How drilldowns can enable specific IT risk response actions
High-order KPIs can tell the global story of a risk, but many risks must be dealt with on a micro level to address or mitigate their potential impact.
For example, parts of an application that must access specific cloud-hosted services may be correlated with service delays or “request not found” errors. If these issues are related to a specific type of upload action, it can indicate the need for a change in cloud server configuration, a different method of access, or a change to a more reliable cloud host.
Some change risks can be associated with a specific team or even team member, allowing for investigation and recommendations to improve their practices. One IT analytics client created a “change risk credit score” to apply to specific change managers. The custom metric measured their current docket of unclosed issues, their rate of introducing changes that caused issues, and so on. Highlighting specific problematic change managers holds teams and individuals accountable, which allows them to not only address shortcomings but measure individual performance improvements objectively.
In other words, assessment of macro-global risks can allow for risk management strategies at the behavioral level of the organization. It can also reveal the need for amendments to processes that have been taken for granted.
Opening the door for predictive analytics and prescriptive AI
High-order metrics can effectively indicate potential risks, yet they may not be able to keep a watchful eye on all true risk factors. By adding artificial intelligence (AI) and ML to an analytics solution, IT leaders can ensure that all potential predictive parameters are known and accounted for in dashboards.
Analysis of past change failures, for instance, can allow an AI-backed analytics solution to automatically identify factors that tend to correlate with risk. Certain times of the month, for example, can lead to a spike in incidents, yet, without ML, this correlating factor may be overlooked. Machine learning can improve the selection of metrics included in the risk model, allowing human actors to promote the ML models that have the most accurate predictive capabilities.
All of this information gives change leaders and IT teams an in-depth, 360-degree view of operational risks along with a scoring of those risks to indicate what response actions may be most appropriate. Insights revealed could lead to proactive change risk mitigation or perhaps even new policies to prevent risk factors from cropping up in the first place.
As AI models become more advanced and integrated within the decision-making process, they can even prescribe certain response actions in light of known risk factors. Some IT operations are able to automate certain change risk mitigation practices, such as flagging certain changes to send back to change managers or automatically reviewing code before deployment to amend common sources of problems.
Adopting AI in this way could streamline IT operations to reduce the effort it takes to address the 70% or so of change-related incidents automatically before they can have an impact. IT risk response teams and CABs are then freeing up resources to focus on the remaining 30% of problems that require their full attention and brain power.
For now, human intervention and supervision is always needed to ensure the drivers of IT production risk are not oversimplified, but actionable visualizations and AI-backed risk modeling can empower processes that accelerate risk identification and mitigation, allowing organizations to keep up in an increasingly competitive agile world.
Learn more about how IT analytics, AI and ML can all combine to make a powerful engine that reduces the weight of risk mitigation decisions and adds agility to IT ops in a white paper by EMA, commissioned by Numerify: “Change Risk Mitigation Best Practices – How To Significantly Reduce Change Management Risk in Production”
The New Role of Change Advisory Boards in an Automated World
In a constantly changing world, processes that worked well a short time ago no longer…
8 IT Operations Metrics That Offer Vital Feedback to Developers
DevOps is supposed to represent a seamless partnership between development and operations, accelerating and simplifying…
AIOps and the Capabilities it Should Deliver
Artificial intelligence (AI) is transforming the world around us in countless ways. However, in an…