Following a sparkly pressie from the guys at TOPdesk, we got to thinking here at Enterprise Opinions towers about what should go in our emergency kit for dealing with Major Incidents.
To be fair, my Starbucks habit is slightly worrying but staying caffeinated helps me stay on the ball. When dealing with a crisis, sometimes you just need a second to figure out the next step. Taking a sip of your drink, be it coffee, green tea or water, takes you out of the situation momentarily and gives you a chance to clear your head and come up with a plan. That said, this is effectively me on a bad day:
Key Phone Numbers
Picture the scene, my second day at a Problem Management gig and a contracter accidentally hits the EPO button in the data centre. For the uninitiated, an EPO or Emergency Power Off button is something that instantly takes out the power to a room and is there as a safety measure in the event of a fire or someone suffering an electrical shock. They’re usually bright red and labeled EPO. Unfortunately in this case, it’s proximity to the door meant the chap in question mistook it for the door release button and pressed it taking out all services to the building and 8 major customers. As this wasn’t a DR test, there was no way to fail over cue an epic Major Incident and the Service Desk sat in shock because they had no working phones and no corporate address book to look up phone numbers. Not our finest hour. No one was saying anything so I did the only thing I could think of at the time; told everyone to use their mobiles to call everyone they had numbers for, starting from the top down until we were able to restore power, promising that I would personally pay their mobile bills if the finance department rejected the resulting phone bill. It was the only option we had at the time and luckily we were back up with the basics in about 30 minutes but my overriding memory of that day was feeling really out of control. Let’s face it, that level of faffery in a Major Incident is never good.
I love my iPhone. It’s pink (obvs) has every app I can think of and goes everywhere with me. It has unfortunately naff all battery life so I carry a charger with me at all times.
Feedback that I’ve had time and time again when I’ve done Incident Management type roles is how calm I seem when things are kicking off. I have no idea why people think I’m calm, I can promise you that it’s all a huge act. Inside my head, I’m having kittens or reciting every swear word I can think of or wishing I could hide under my desk but when I have a Service Desk full of analysts relying on me, I’m not going to let everyone down by panicking and then making mistakes. I guess you could say it’s a bit like parenting, as a mum of three I can tell you that kids can sense uncertainty, fear and in the case of my little darlings, chocolate buttons at twenty paces so the trick is to have a total air of “I’ve got this”. Fake it til you make it; act as though everything’s grand, you’ll calm down which will in turn calm down everyone around you and you can focus on getting everything fixed.
What would you have in your “break glass in case of emergency kit”? Let us know in the comments!
One of the things that isn’t covered as much as it should be is how to respond to a crisis directly linked to Change activity. This is one of those things where despite your Change process, despite your sparkly toolset and fab policies & work instructions, something has gone pear-shaped on a massive scale and you’re staring down the barrel of a Change Management related crisis.
Here is my guide to dealing with the fallout, without having to resort to mainlining chocolate or vodka.
Keep calm! Easier said than done I know (and believe me, I know) but panicking or making reactive, snap decisions won’t help things and might actually make them worse. Take a deep breath, roll up your sleeves and get stuck in.
Don’t believe me about how panicking can make things worse? A couple of years ago, I was working on a client site in Milton Keynes. This client had a financial system for each EU country that was accessed by a command line interface. One fateful Wednesday, the German financial system experienced an issue and the tech support guy had to run a fix script and restart the system. The Service Delivery Manager for Germany, in her infinite wisdom, thought that standing over the poor techie (we’ll call him Bob) and shouting at him would speed thing up. It didn’t. What actually happened was poor Bob typed the command for rebooting the UK system rather than the German system onto his console, taking out all transactions for the UK and doubling the impact of the Incident. Not our finest hour I think you’ll agree.
Get a handle on where you’re at. Is the Incident still on-going? First and foremost, look after your people. Is the environment safe? Are there any imminent hazards? Chances are Incident Management or Problem Management are running with the drama, or if it’s really serious, a Crisis Manager or IT Service Continuity management. Regardless of who is running with the fallout, you need to play a supporting role in helping them understand what the Change was meant to do and what went wrong. Hopefully, you already have a documented process for invoking the help of Incident Management if a Change fails with documented roles and responsibilities. If there’s nothing in the process for what happens if a Change fails spectacularly, then step up and help co-ordinate the fix effort. You can look at lessons learned, the chain of command and who sorts out what later on, once the immediate danger has been contained.
Figure out who’s going to handle comms. Not just the generic Service Desk e-mail but depending on the magnitude of the issue, you may well have to communicate with:
(1) Angry customers
(2) Angry stakeholders within your business
(3) The press
(4) Regulatory bodies.
Make sure only authorised people speak to the relevant parties so that you don’t make an already bad situation worse.
As a general rule, I think that for internal reporting, the more detail the better. This will help you understand what went wrong, how we fixed it, could we have done anything differently and what we can do to stop it from happening again. Internal reporting will cover exactly what happened, the technical details and if there was any human error. For external customers; all we need to cover is what went wrong, how we fixed the issue as quickly as possible, were there any opportunities for Continual Service Improvement (CSI) and that we have taken the appropriate action to prevent recurrence. Basically, saying that the engineer booted in daft mode isn’t something that will every look acceptable in the eyes of most customers.
Have we got a fix? Test, test and test again. Check and double check anything that goes out to your customers. First things first, have we got the right people working on the fix and do they have enough support? If not then work with your peers to get agreement that as it’s a crisis situation, break – fix work must be prioritised. Support your technical teams by helping them get the Emergency Change raised and if appropriate, setting up an E-CAB. Every organisation is different and for some, the paperwork can be raised retrospectively so there is no delay to implementing the fix. Either way, to save time, your Emergency Change probably won’t be as detailed or nicely formatted as a BAU Change request but should at least have:
• The nature of the work
• Who is carrying it out
• How it’d been tested
• Linked Incidents
• How it has been verified and how we can prove it has fixed the issue
• Who’s authorised it (even if it’s a senior manager shouting JFDI make a note of it on the Change)
• Any customer communications
Let your customers & stakeholders know when the Change is due to be implemented then send out a second communication once the Change has gone in, the Incident has been resolved and all is well.
The post mortem; aka the Incident Review, Change Review or drains up session. This is not, I repeat, not a witch hunt. Co-ordinate your review activities with Incident & Problem Management. The last thing you want is for your guys to be stuck in 3 separate meetings, answering the same questions just being asked in slightly different ways. Set ground rules and reassure everyone in the room that the meeting is to look at what happened and how it can be prevented from recurring, not to assign blame, At this point I use something called my umbrella speech to get everyone in the room to relax. I won’t bore you with a big wafflely speech but the gist is something like the following:
“ I just want to understand what happened. Think of me as your umbrella. I ask all the difficult/ horrible/controversial questions so that you’re not getting interrogated by senior managers or irate customers. You can trust me because you know me, we’re all in the same team plus we’ll all be going down the pub together for a stiff drink once this is over.”
Getting people to understand that they’re in a safe environment is hugely important. If your guys feel supported, then an honest, constructive conversation can take place to understand the root cause, short term fix actions, long term fix actions and anything else that could prevent recurrence.
Follow up with customers and stakeholders. This isn’t a one off activity, on the day of the crisis; you may find yourself in hourly meetings to keep senior management or customers updated. Once the issue has been resolved, you can issue a short holding statement that explains what happened and the actions being taken to stop a repeat performance. The next step is important. If there are lots of post Change / Incident / Crisis actions, commit to regular updates in an agreed format. These updates could be in the form of updating a Problem record or a weekly update e-mail or a Service Improvement Plan depending on the magnitude of the failure.
Look after your people. Chances are, they’re tired, stressed out and frazzled from working all hours to put things right. Now is the time for that trip to the pub or a morale boast in the form of caffeine, chocolate or pizza.
Look at your FSC. Are there any similar pieces of work planned that need to be cancelled, reworked or rescheduled? How are you going to reassure your customers that lessons have been learned and the same mistakes will not happen again? As a Change Manager, you need to ensure, that if a similar piece of work is planned in the future, it has actions present in all stages of the implementation plan to ensure the issues you’ve just sorted out don’t come back to haunt you.
Sometimes, the most appropriate action to take is to delay the Change as a good will gesture to give the customer additional time to communicate to any of their onward customers or to take additional steps externally to mitigate risks. I’m not saying it’s always the right thing to do, or the easiest thing to do but sometimes a little time and breathing space can work wonders.
Remember your lessons learned. When you had your post mortem you will have come away with a lovely long list of actions to make this better. I know it’s easier said than done but don’t just file them away under “ugh horrible day – glad it’s over” add them to your lessons learned log so that those actions are documented, reviewed and acted on. If you don’t have a lessons learned log then start one up. You need to be able to refer back to it, to read your actions and to share them. Examples of sharing lessons learned could be Availability Management for downtime issues or Capacity Management for performance glitches.
If you don’t have the Service Desk and Incident Management on your CAB, invite them immediately. One of the things that never fails to surprise me out on consultancy gigs is how many organisations simply deploy Changes into the production environment without thinking to mention it to the people who have to pick up the pieces if it all goes horribly wrong. Ditto Problem Management and IT Service Continuity Management, as simply assuming that they will be able to swoop in and save the day should the Change fail is taking blind optimism a step too far.
As someone who has seen and managed her fair share of own goals and Change related failures here are some things I find it helpful to refer back to.
(1) This too shall pass/Nothing endures. This quote and its connected story was recently referenced in “The Big Bang Theory”. The fable goes that there was once a king who assembled a group of wise men to create something that would make sad men happy and happy men sad. The result was a ring inscribed with the phrase, This Too Shall Pass. Take a deep breath…nothing lasts forever!
(2) It helps to have a sense of humour when dealing with the bad stuff. This is one of my favourite quotes from the author Terry Pratchett:
“Some humans would do anything to see if it was possible to do it. If you put a large switch in some cave somewhere, with a sign on it saying ‘End-of-the-World Switch. PLEASE DO NOT TOUCH’, the paint wouldn’t even have time to dry.”
In short, no matter how good your process is, and no matter how hard you try, things will go wrong sometimes. You can’t stop all problems; what matters is how you deal with it. So keep calm, take a deep breath and let’s sort things out.