Four Problem Management SLAs you really can't live without

Simon Higginson

This article has been contributed by Simon Higginson.

Problem Management is the intriguing discipline of the Service Management suite.  The IT Department is continually being asked to be proactive not reactive.

Often in IT we presuppose what our customers in the business require, then give them a solution to issues that they didn’t know that they had.  But what happens when that business customer is asking IT for a permanent solution to an issue we might not have known that we had, or to an issue where we know only a sticking plaster fix is in place?

Your Problem Manager is the key

Step up to the plate the Problem Manager, the individual focussed on reacting to, and managing, issues that have already happened. They can’t really help but have a reactive mindset, rooted in the analysis of fact.  The incident might be closed but the Problem Manager is the person entrusted with ensuring that appropriate steps are taken to guarantee the incident doesn’t repeat itself.  It can be a stressful role, the systems were down, the company perhaps lost, and may still be, losing money, trading has been impacted.  People want to know what is being done.  So what SLAs can be put in place between the Problem Manager and the service owner to support the Problem Manager’s activities and maybe give them breathing space, whilst at the same time ensuring that there is some focus on resolution?

Lets look at the four problem management SLAs that you really can’t live without

#1 – Provision of Problem Management reference number

A simple SLA to get you started.  This is simply an acknowledgement by the problem management team that the problem has been logged, referenced and is in the workflow of the team.  It provides reassurance that the problem is going to be dealt with.

#2 – Time to get to the root cause of the issue

So this is where some breathing space is provided.  The message being given in this particular SLA is that there is a distinction between incident management and problem management.  Incident management has resulted in a temporary fix to an issue, now it is the turn of problem management to actually work out what lay at the heart of the matter – what was the root cause.

Note this is an SLA about identifying and not resolving the root cause – that could take a significant time period involving redevelopment of code.

The outcome that is being measured by the SLA is going to be the production of a deliverable, perhaps in the form of a brief document or even just an email that highlights the results of the root cause analysis.  Each company will have to determine its own policy of what that deliverable might contain, but the SLA is there to measure the time between the formal closure of the incident and the formal provisioning time of problem management’s root cause analysis deliverable.

#3 – Measurement of provision of Root Cause Analysis documentation.  To be provided within X working days of initial notification.

So, you’ve acknowledged receipt of the problem, and you’ve determined the root cause. The next SLA is in place to ensure that a formal document is delivered in a timely fashion. It should have a set format and set down the timeline of events that caused the problem, and actions that have been taken to provide a workaround. It should then list all of the actions and recommendations together with clearly identified owners that need to be completed by realistic dates in order to fix the problem. A suggested target date would be 3 days for simple problems and 5 and 10 days for increasingly more complex ones.

#4 – Measurement of progress on root cause analysis actions as agreed (Target dates not to change more than twice)

In the previous SLA we have measured the time to produce the root cause analysis.  This SLA takes over where the previous clock stopped.

The root cause analysis work will have identified actions that need to be undertaken and implemented to affect a permanent fix to the original issue and allow the sticky plaster solution to be superseded.

However, all resolutions will not be equal in complexity, effort and duration, therefore there will be an initial estimation of a target date for live implementation of a permanent fix.  Moving the target completion date is allowed, however this SLA limits how often this can occur to prevent action timescales drifting.

This article has been contributed by Simon Higginson of Frimley Green Ltd, Simon’s expertise is helping clients get the best out of their service suppliers and creating win-win partnerships.

7 thoughts on “Four Problem Management SLAs you really can't live without”

  1. Hi Simon
    You are quite right about the importance of Problem Management.

    A pedantic point: An SLA is a contract between two people or human groups. An SLT is a target within that SLA. These four targets are SLTs not “SLAs”. (To be really pedantic, they may often be within OLAs not SLAs)

    A couple of less-pedantic points:
    1) there are many causes to a problem. Nominating one as the “root” cause is highly subjective. So if a timeframe is set, i’ll always comply. “oops we’re out of time. Yup that one looks like the root to me”.
    2) time to diagnose? time to fix? it is worth measuring these to spot trends. But how can we reasonably set them as targets? “Doctor, you must diagnose all patients within two hours” “Fireman, you must extinguish all small fires within three days, all medium ones within eight hours, and all enormous ones within two hours”.
    SLTs are not just metrics. They are thresholds that people are held accountable for. If they are not fair then they won’t work, and i don’t think some time-based SLTs are fair.
    #1 and #3 are fine: they are defined standardised sequences of steps that can be repeatably performed, and improved to meet the SLT. #2 and #4 are anyone’s guess: each problem is unique and some are unsolvable. By all means measure the trend, but the time for a particular problem is out of any person’s control.

  2. I should say up front that I’m not a fan of SLAs, mainly because they so often drive the wrong behaviors, and customers have come to see them as CYA for IT.

    I agree with Rob that #2 and #4 are not particularly useful targets. My concern is that they would drive poor behaviors as often as good.

    My bigger concern is that this set of targets focus on only a small part of problem management. Fundamentally, should problem management have SLAs? I’m not sure it should. Timeliness and efficiency are important, but I think the issue you are really trying to address is one of transparency. I like #1 and #3 because they directly address transparency. Even then I don’t think they should be SLAs. They should be documented measurable steps in the process, and make for useful KPIs (what % of problem records include full RCA documentation?).

    Problem management exists for the purpose of reducing the quantity and/or impact of Incidents. If the SLAs don’t directly drive behaviors accomplishing one or both of those outcomes, they shouldn’t be there.

    SLAs “to support the Problem Manager’s activities and maybe give them breathing space”? Customer agreements are not the place to address that problem. Transparency and communication go a long way to relieving the Problem Manager’s stress.

  3. I tweeted that I could not disagree more with this article. Now Rob and Dan have already gone over some of the problems in the article but here are my main concerns:
    1) All ITSM processes and functions exist to create value. If a process does not create value, it’s worthless even if it runs perfectly. You must measure value.
    2) The concept of root cause should be forgotten. (Actually also incident and problem but that is a different story.) When a service fails, it needs to be restored asap. This may well require finding the causes behind the failure.
    The service should run without failures. To ensure it, the service provider should have a risk management team which studies the operation and tries to prevent or mitigate potential failures. Recent failures are a good source for improvement actions BUT THEY ARE NOT THE ONLY SOURCE.
    So, unlearn ITIL bad practice and study Risk Management. ISO 31000 is a good source of Real Good Practice.

  4. I blogged some other suggested KPIs for problem Management

    They are

    Mean time to first respond to problems, by priority

    Satisfaction of the service desk and incident management team(s) with problem management

    Reduction in a composite risk index for the problem portfolio

    Percentage of problems resolved

    Percentage of recurring problems

    Percentage of problems “standardised”: documented so they can be dealt with in a pre-defined manner when they recur

    Costs and resource usage in problem management

    Percentage of shelved problems (“cold cases”)

    Number of knowledge articles produced – especially workarounds – by the problem management technicians

Comments are closed.