Eliminate Recurring Incidents

A Service Desk typically handles about fifty tickets per day per thousand end users with a staff of about twenty analysts per thousand end users. But there is often a significant number of recurring tickets – in some companies as many as 10-20%.

Aside from the fact that it is a waste of effort for people to do the same work over and over again, part of effective IT service management is dealing with unnecessary tickets that increase the risk of mis-handling, with associated …

eliminateRecurring_COVER

Please Note: Some content may reference images that are only available in the PDF version of the white paper, which free to download above.

ELIMINATE RECURRING INCIDENTS

Situation:

A Service Desk typically handles about fifty tickets per day per thousand end users with a staff of about twenty analysts per thousand end users. But there often a significant number of recurring tickets –
in some companies as many as 10-20 percent.

Aside from the fact that it is a waste of effort for people to do the same work over and over again, unnecessary tickets increase the risk of mis-handling, with associated escalations and reduction in customer satisfaction. Even teams that have already implemented problem management often leave much hidden fruit.

Solution:

If the Service Desk is inundated with incidents, job #1 is to get control of the situation. Create a
repeatable process for registering incidents and resolving them quickly. Once that has been accomplished with a basic service level defined for incident resolution, there is still a lot more that can be done to streamline incident handling.

Establishing a discipline for Root Cause Analysis using the ITIL Problem Management process will enable a repeatable method to identify and eliminate recurring failures. With the right reports, IT can establish baselines and the benefits of process improvements.

Simple reports cannot do much other than track count and number of new problem records. Deeper insights to identify root cause can be found easily using advanced analytics to highlight terms common incidents and tracking progress over time.

Getting the most from Problem Management

The goal for Problem Management is quite simple – to reduce the number of recurring incidents that are wasting time and effort. But if success is achieved, and common Incidents occur less frequently, it may not be easy to detect “non-events”.

Unless the IT environment is particularly chaotic, it’s unlikely that there will be wholesale drops in ticket volume. Reasons for this include new users, bring your own device, new releases, and the increasing complexity and pace of change in the IT infrastructure. With all that said, there are ways to tell if problem management is succeeding:

  • Availability of services – regularly meeting business SLAs
  • Backlog of problems – staffed appropriately
  • Average age of problems – showing progress

To really show the benefits of implementing Problem Management requires a detailed baseline incident rates. With a baseline in place, it is possible to show improvements in specific categories, classes of resolution and other attributes. To assess the value of preventative work, estimate the effort to resolve such incidents and assign a cost based on the FTE hourly rate.

In a business that depends on availability of certain IT services, for example an e-commerce web site, calculate revenue saved based on higher availability and historical norms of revenue generation.

BEST PRACTICE APPROACH

1. Establish the Need

Recurring incidents are a waste of resources, preventing the most valuable IT resources from working on activities that increase the pace, quality and quantity of delivered IT innovations.

The incident process should restore service as quickly as possible. Documenting resolution activities in the incident record creates a treasure trove for subsequent analysis of common modes of failure and restoration. Problem management can then analyze the top categories of incidents to identify common incidents that could be prevented rather than being handled reactively through the Service Desk.

Key analysis: top-n incidents by category

Top categories for incident volume (category). A high volume of incidents indicates a higher risk of impact on the business and effort being applied to maintain service. Once the top categories are identified, use text analytics to determine the top keywords for each category.

Search the incident repository using the top keywords to quickly find common requests or recurring incidents. If an incident can reasonably be handled by self-service or a Level 1 Service Desk analyst, have a domain expert write a knowledge article explaining how to diagnose and resolve the issue.

2. Build the Practice

The next step is to implement proactive steps to eliminate recurring incidents.

The tactic so far has been to minimize the impact of incidents by enabling faster resolution at the Service Desk. The second phase of attack seeks to understand why incidents are recurring. Problem management introduces Root Cause Analysis to identify the “why” of an incident, and develops proposed solutions to prevent them from happening in the first place. Any incident has the potential to be a recurring incident. Without a more structured approach, a problem manager can easily get overloaded.

Key analysis: Incidents with a temporary workaround

Incidents with the temporary workaround flag set (#). This analysis of the incident repository identifies incidents for which a temporary workaround is being applied. These are ideal candidates for a problem management without requiring a broad review of all incidents.

Key analysis: Multi-user incidents with no related event

Incidents with no related event (#). Event management aims to warn of failures before they become critical or user impacting. But inadequate coverage and rule thresholds can give a false sense of security. This analysis identifies incidents that are being detected by end users with prior warning from the NOC of the impending failure.

3. Measure the Success

Show the value of problem management by tracking the progress.

Once root cause analysis has been completed, a proposed solution should be defined. Solutions range from software updates, application configuration improvements, and memory upgrades, to elimination of single points of failure. Tracking the reduction in incidents of this class of failure will highlight the savings associated with prior reactive service restoration.

Key analysis: problem backlog

Number of open problem records (#). Track the number of open problem investigations awaiting completion. The backlog provides an indication of risk associated with the current infrastructure and processes. Once the process is well established, sudden increase in backlog should give cause for concern and may warrant assignment of additional resources.

Key analysis: problem aging by time bucket

Percentage of problem records whose Root Cause Analysis activities have not yet completed by time bucket (%). It is good practice to assign an SLA for RCA. For simple incidents, consider an RCA time of 3 days, for more complex incidents it may take 5-10 days.