Suppose you drive over a pothole and your bumper comes off. You decide to use duct tape to reattach it. However, each time you go over a pothole or speed bump, the bumper falls off again. And you keep re-attaching it with new duct tape.
Without solving the real root cause of your problem and determining a sustainable fix (one that saves you time and many rolls of duct tape), you will continue to patch your car whenever you encounter a bump in the road.
Similarly, far too many businesses focus on fixing the proximate causes of their IT woes, such as crashes and performance issues, with fixes such as rebooting the system. Unless the underlying problem is addressed, however, you’ll continue to experience the same difficulties again and again.
In this article, we’ll go over everything you need to know about root cause analysis: what it is, why it’s important, and how you can implement it within your business.
What is Root Cause Analysis (RCA)?
Root cause analysis (RCA) is the process of finding the root causes of IT problems and issues. RCA aims to get to the bottom of an issue so that you can both solve the problem (instead of patching it) and prevent it in the future.
When a system outage occurs, for example, the most urgent task is to get it back up and running again. Afterward, however, you should perform RCA to uncover the underlying issue: technical problems with a server or application, insufficient testing protocols, or staff that aren’t well-trained in maintaining the system.
Solutions to the root causes may include changing your technology, redesigning your processes, and retraining or hiring new personnel.
The first step in any root cause analysis is to define the problem or event and then establish a timeline. Techniques to accomplish this include:
- “5 Whys” technique: repeatedly asking “why” something occurred until you trace the chain back to the root cause.
- “Fishbone” diagrams (aka Ishikawa diagrams): start with the immediate problem at the “head” of the diagram and add the causes extending off the “backbone” in branches and sub-branches.
Techniques such as the “5 Whys” and fishbone diagrams help you distinguish between items that are root causes and those that are simply causal factors. When done right, RCA should help you diagnose and solve the ultimate source of the problem, preventing it from recurring.
The Difference Between “Technical Causes” and “Root Causes”
Technical causes are what many people (erroneously) think of as root causes. They are the first-level causes that most immediately explain the observed problem. In fishbone diagrams, technical causes are the intermediate branches of the diagram between the observed problem and the root causes at the end of the branch.
When an incident occurs due to an application outage, for example, one technical cause could be a hotfix that was applied just before the outage. However, the root cause of the incident may extend even deeper, requiring you to investigate and ask more questions:
- Was the patch applied incorrectly?
- Are patches routinely applied incorrectly by a particular assignment group?
- Have the members of this assignment group been adequately trained on the best practices for applying hot fixes?
Why is Root Cause Analysis Important?
RCA gives you back control over your business IT by helping you parse and determine the factors that contribute to different problems and events. Instead of being reactive to unexpected situations, putting out different IT “fires” across your business, you can proactively manage and monitor your IT environment.
It’s important not to overly simplify, blame a single factor, or build an incomplete solution that treats the symptoms rather than the underlying problem. RCA is an in-depth, considered process that makes sure you get to the bottom of any IT issues you encounter.
Which IT Analytics Techniques Can Help With RCA?
RCA should always be used for common IT problems that occur frequently. There are a number of different analytical techniques that can assist you in performing RCA:
- Descriptive and Diagnostic analytics (aka Business Intelligence) on data from your IT operations (e.g., ITSM data) can help identify hidden trends and performance issues.
- Machine learning techniques, like clustering, can help to make sense of the massive quantities of data that your ITSM processes generate by processing permutations and combinations of the data fields in thousands or millions of incidents.
- Natural language processing (NLP) algorithms can index and analyze the text from incident descriptions, uncovering hidden issues beyond the standard structured fields.
Numerify’s analytical insights offer you visibility into your IT operations, making it easy to perform root cause analysis.
Want to learn more about how we can help you perform RCA, making your business more efficient and productive?
Download our white paper “Visibility to Drive Digital Transformation: Why IT Needs a System of Intelligence.”
[Image credit: Unsplash]