Availability, Incident and Problem Management – The New Holy Trinity? (part 1)

So here’s the thing. We all know that incident and problem management, if working well, can reduce interruptions to the end user and improve service quality for the business. From an end user’s perspective though, availability is the name of the game. While most organisations have the basics covered with incident management, how many use problem & availability management to look at the underlying cause of Incidents at a service as well as a component level?

Working together effectively, availability, incident & problem management can improve both quality of service and the business perception of IT. Getting back to basics, incident management is a purely reactive process. We sort things out so that the business can carry on as usual. Problem management is both reactive and proactive. We look at what went wrong but also how to stop it from happening again. Availability management looks at all availability issues at both a component & service level, ensures that we consider availability at the point of service design as well as monitoring up time during normal operations.

When describing the three processes, I call incident management the superheroes of ITIL. They save the world several times a day, fighting fires and making people smile. Problem management are detectives. They get to the root cause and sort it out to stop the same issues from recurring. Availability management are the scientists of the ITIL world. Like the guys from The Big Bang Theory, they design the service to keep it up & running as much as possible based on user requirements.

Today, IT service issues are constantly in the news. With the advent of social media, news of service downtime can be spread globally in minutes – kind of embarrassing especially if you are a highly visible entity such as a bank or government department. Putting aside the embarrassment factor for a minute, what about financial implications such as fines, service credits? Or regulatory impact such as failing to comply with any standards mandated by your management. Lets not forget the angry mob waiting outside to make their dissatisfaction known if downtime is an own goal such as a poorly managed change. With this in mind, I’ve put together some tips on how to use availability, incident and problem management to maximise service effectiveness, with this article covering the first three of ten.


Tip 1: Getting your facts straight

Have separate records for availability, incident & Problem Management. Incident Management records “fix it quick” should focus on getting the user details and a full description of the issue. Some of the information captured by Incident records could include:


When managing an Incident, different support teams may need different views e.g.

  • Networks team – by location
  • Service desk – by customer satisfaction
  • Desktop support – by hardware
  • Development – by software application
  • Capacity management – by resource usage
  • Service delivery managers – by business impact
  • Change management – by date / time to compare with the change schedule

Problem management records focus on establishing the root cause and actions to prevent recurrence. Problem records can contain the following information:


Availability records should look at planning for the appropriate level of availability and ensuring that availability & recovery criteria are considered when designing new services. Your availability plan should contain the following information:


Tip 2: Identify roles & responsibilities

Be organised so there’s no duplication or wasted effort. In short the incident manager is concerned with speed, the problem manager is concerned with investigation and diagnosis and the availability manager is concerned with the end to end service.

Key priorities for the incident manager will include co-ordinating the incident, managing communications with both technical support teams and business customers, and ensuring that the issue is fixed ASAP.

The problem manager will focus on root cause investigation, trending (has this issue popped up before?), finding a fix (interim workarounds and permanent resolution) and ensuring that any lessons learned are documented & acted on.

The availability manager will look at ensuring the service is designed with the appropriate levels of availability, working with service operations to tackle issues at both a service and component level and using the extended incident cycle to look at trends and how the service can be improved.

Tip 3: Keeping up to date

It’s really important to keep an eye on the BAU as seeming small incidents can spiral out of control and have a negative effect on availability levels and customer satisfaction. Simple things can make a big difference for example, placing a white board near the service desk with a list of the top ten problems so that it’s easy for service desk analysts to link incidents to problems so that trends can be identified later on. If the service desk have a team meeting ask to attend and update them on any new problems as well as updates and workarounds on existing problems. Don’t forget to close the loop and let the service desk know when a problem record has been fixed and closed off, there’s nothing worse for a service desk to have to call a list of customers about an issue that was sorted out months ago!

Get proactive! Work as a team to view service availability through out the month. Have a process to automatically raise a new proactive problem record if availability targets are threatened so that things can be done to prevent further issues. Don’t just sit there waiting to fail the SLA!


In part two, I will continue with a further seven tips on how to use availability, incident and problem management to maximise service effectiveness.

2 thoughts on “Availability, Incident and Problem Management – The New Holy Trinity? (part 1)”

  1. Hi Vawns, as a problem manager I couldn’t agree more with your article, availability is experienced directly by the customer and they will perceived that as service quality at this point no matter how quickly we restore service or how detailed the Root Cause Analysis the perception is a much harder challenge to turn around..

    It’s very easy to look at servers, helpdesk hardware and network firewalls and
    loose that connection to the end user. I work with a large number of Scum
    Masters and product owners and have been working to balance the drive of
    new functionality and reliability/availability .
    I’m looking forward to the next instalment

Comments are closed.