Recently I’ve been working on Incident Management, and specifically on Major Incident planning.
During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.
For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.
At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.
ITIL talks about Major Incident planning in a brief but fairly helpful way:
A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.
So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.
The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.
What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.
Working with a Major Incident
When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.
Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.
Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.
If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!
What is a Major Incident
It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.
Having this defined is an important step, but will vary between organisations.
Roughly speaking a generic definition of a Major Incident could be
- An Incident affecting more than one user
- An Incident affecting more than one business unit
- An Incident on a device on a certain type – Core switch, access router, Storage Area Network
- Complete loss of a service, rather than degregation
Is a P1 Incident a Major Incident?
No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.
Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.
Do I need a single Incident or multiple Incidents for logging a Major Incident?
This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.
The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.
If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.
Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.
Is a Major Incident a Problem?
No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.
Remember the intended outcome of the Incident and Problem Management processes:
- Incident Management: The outcome is a restoration of service for the users
- Problem Management: The outcome is the identification and possibly removal of the causes of Incidents
The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.
Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.
How can I measure my Major Incident Procedure?
I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.
- Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
- The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
- Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently
There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.
I hope that you found this article helpful.