Following a sparkly pressie from the guys at TOPdesk, we got to thinking here at Enterprise Opinions towers about what should go in our emergency kit for dealing with Major Incidents.
To be fair, my Starbucks habit is slightly worrying but staying caffeinated helps me stay on the ball. When dealing with a crisis, sometimes you just need a second to figure out the next step. Taking a sip of your drink, be it coffee, green tea or water, takes you out of the situation momentarily and gives you a chance to clear your head and come up with a plan. That said, this is effectively me on a bad day:
Key Phone Numbers
Picture the scene, my second day at a Problem Management gig and a contracter accidentally hits the EPO button in the data centre. For the uninitiated, an EPO or Emergency Power Off button is something that instantly takes out the power to a room and is there as a safety measure in the event of a fire or someone suffering an electrical shock. They’re usually bright red and labeled EPO. Unfortunately in this case, it’s proximity to the door meant the chap in question mistook it for the door release button and pressed it taking out all services to the building and 8 major customers. As this wasn’t a DR test, there was no way to fail over cue an epic Major Incident and the Service Desk sat in shock because they had no working phones and no corporate address book to look up phone numbers. Not our finest hour. No one was saying anything so I did the only thing I could think of at the time; told everyone to use their mobiles to call everyone they had numbers for, starting from the top down until we were able to restore power, promising that I would personally pay their mobile bills if the finance department rejected the resulting phone bill. It was the only option we had at the time and luckily we were back up with the basics in about 30 minutes but my overriding memory of that day was feeling really out of control. Let’s face it, that level of faffery in a Major Incident is never good.
I love my iPhone. It’s pink (obvs) has every app I can think of and goes everywhere with me. It has unfortunately naff all battery life so I carry a charger with me at all times.
Feedback that I’ve had time and time again when I’ve done Incident Management type roles is how calm I seem when things are kicking off. I have no idea why people think I’m calm, I can promise you that it’s all a huge act. Inside my head, I’m having kittens or reciting every swear word I can think of or wishing I could hide under my desk but when I have a Service Desk full of analysts relying on me, I’m not going to let everyone down by panicking and then making mistakes. I guess you could say it’s a bit like parenting, as a mum of three I can tell you that kids can sense uncertainty, fear and in the case of my little darlings, chocolate buttons at twenty paces so the trick is to have a total air of “I’ve got this”. Fake it til you make it; act as though everything’s grand, you’ll calm down which will in turn calm down everyone around you and you can focus on getting everything fixed.
What would you have in your “break glass in case of emergency kit”? Let us know in the comments!
Most readers have got the story now from my recent articles: Cherry Valley, Illinois, 2009, rain bucketing down, huge train-load of ethanol derails, fire, death, destruction.
Eventually the Canadian National’s crews and the state’s emergency services cleaned up the mess, and CN rebuilt the track-bed and the track, and trains rolled regularly through Cherry Valley again.
Then the authorities moved in to find out what went wrong and to try to prevent it happening again. In this case the relevant authority is the US National Transportation Safety Board (NTSB).
Keep asking why?
Every organisation should have a review process for requests, incidents, problems and changes, with some criteria for triggering a review.
In this case it was serious enough that an external agency reviewed the incident. The NTSB had a good look and issued a report. Read it as an example of what a superb post-incident review looks like. Some of our major IT incidents involve as much financial loss as this one and sadly some also involve loss of life.
IT has a fascination with “root cause”. Root Cause Analysis (RCA) is a whole discipline in its own right. The Kepner-Tregoe technique (ITIL Service operation 2011 Appendix C) calls it “true cause”.
The rule of thumb is to keep asking “Why?” until the answers aren’t useful any more, then that – supposedly – is your root cause.
This belief in a single underlying cause of things going wrong is a misguided one. The world doesn’t work that way – it is always more complex.
The NTSB found a multitude of causes for the Cherry Valley disaster. Here are just some of them:
It was extreme weather
The CN central rail traffic controller (RTC) didn’t put out a weather warning to the train crew which would have made them slow down, although required to do so and although he was in radio communication with the crew
The RTC did not notify track crews
The track inspector checked the area at 3pm and observed no water build-up
Floodwater washed out a huge hole in the track-bed under the tracks, leaving the rails hanging in the air.
Railroads normally post their contact information at all grade crossings but the first citizen reporting the washout could not find the contact information at the crossing where the washout was, so he called 911
The police didn’t communicate well with CN about the washed out track: they first alerted two other railroads
There wasn’t a well-defined protocol for such communication between police and CN
Once CN learned of the washout they couldn’t tell the RTC to stop trains because his phone was busy
Although the train crew saw water up to the tops of the rails in some places they did not slow down of their own accord
There was a litany of miscommunication between many parties in the confusion after the accident
The federal standard for ethanol cars didn’t require them to be double-skinned or to have puncture-proof bulkheads (it will soon: this tragedy triggered changes)
There had been a previous washout at the site and a 36” pipe was installed as a relief drain for flooding. Nobody calculated what size pipe was needed and nobody investigated where the flood water was coming from. After the washout the pipe was never found.
The county’s storm-water retention pond upstream breached in the storm. The storm retention pond was only designed to handle a “ten year storm event”.
Local residents produced photographic evidence that the berm and outlet of the pond had been deteriorating for several years beforehand.
OK you tell me which is the “root cause”
Causes don’t normally arrange themselves in a nice tree all leading back to one. There are several fundamental contributing causes. Anyone who watches Air Crash Investigation knows it takes more than one thing to go wrong before we have an accident.
Sometimes one of them stands out like the proverbial. So instead of calling it root cause I’m going to call it primary cause. Sure the other causes contributed but this was the biggest contributor, the primary.
Ian Clayton once told me that root cause …er… primary cause analysis is something you do after the fact as part of the review and wash-up. In the heat of the crisis who gives a toss what the primary cause is – remove the most accessible of the causes. Any disaster is based on multiple causes. Removing any one cause of an incident will likely restore service. Then when we have time to consider what happened and take steps to prevent a recurrence, we should probably try to address all causes. Don’t do Root cause Analysis, just do Cause Analysis, seeking the multiple contributing causes. If we need to focus efforts then the primary cause is the one, which implies that a key factor in deciding primacy is how broad the potential is for causing more incidents.
All complex systems are broken
It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management. You must read it. Now. I’ll wait.
It says that all complex systems are broken. It is only when the broken bits line up in the right way that the system fails.
It dates from 1998! Richard Cook is a doctor, an MD. He seemingly knocked this paper off on his own. It is a whole four pages long, and he wrote it with medical systems in mind. But that doesn’t matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.
“Complex systems run as broken systems”
“Change introduces new forms of failure”
“Views of ‘cause’ limit the effectiveness of defenses against future events… likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly.”
“Failure free operations require experience with failure.”
Many times the person “to blame” for a primary cause was just doing their job. All complex systems are broken. Every day the operators make value judgements and risk calls. Sometimes they don’t get away with it. There is a fine line between considered risks and incompetence – we have to keep that line in mind. Just because they caused the incident doesn’t mean it is their fault. Think of the word “fault” – what they did may not have been faulty, it may just be what they have to do every day to get the job done. Too often, when they get away with it they are considered to have made a good call; when they don’t they get crucified.
That’s not to say negligence doesn’t happen. We should keep an eye out for it, and deal with it when we find it. Equally we should not set out on cause analysis with the intent of allocating blame. We do cause analysis for the purpose of preventing a recurrence of a similar Incident by removing the existing Problems that we find.
I will close by once again disagreeing with ITIL’s idea of Problem Management. As I said in my last article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.
It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?” That is dealt with by Risk Management (an essential practice that ITIL does not even recognise) feeding into Continual Service Improvement to remove the risk. The NTSB were not doing Problem Management at Cherry Valley.
Next time we will look at continual improvement and how it relates to problem prevention.
Recently I’ve been working on Incident Management, and specifically on Major Incident planning.
During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.
For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.
At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.
ITIL talks about Major Incident planning in a brief but fairly helpful way:
A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.
So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.
The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.
What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.
Working with a Major Incident
When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.
Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.
Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.
If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!
What is a Major Incident
It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.
Having this defined is an important step, but will vary between organisations.
Roughly speaking a generic definition of a Major Incident could be
An Incident affecting more than one user
An Incident affecting more than one business unit
An Incident on a device on a certain type – Core switch, access router, Storage Area Network
Complete loss of a service, rather than degregation
Is a P1 Incident a Major Incident?
No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.
Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.
Do I need a single Incident or multiple Incidents for logging a Major Incident?
This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.
The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.
If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.
Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.
Is a Major Incident a Problem?
No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.
Remember the intended outcome of the Incident and Problem Management processes:
Incident Management: The outcome is a restoration of service for the users
Problem Management: The outcome is the identification and possibly removal of the causes of Incidents
The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.
Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.
How can I measure my Major Incident Procedure?
I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.
Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently
There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.