One of the things that I wish was covered in more detail during the ITIL intermediate training is how to properly impact assess Changes. Change Managers are the guardians of the production environment so making sure that all changes are properly assessed and sanity checked is a key part of service delivery. Asses too low and high risk rangers go through unchallenged, assess too high and you block up the process by examining every change no matter how small as if they could kill your organisation.
Here are the things that I look for when assessing a Change:
Does it highlight the affected services so that it’s easy to identify in any reports?
Is it clear and does it make sense? Sounds basic I know but let’s make it easier for the other people assessing and authorising the Change.
Why are we doing the Change? Remember, this isn’t just about technology, what about business and financial benefits?
What are the risks in carrying out this Change? Has a risk matrix been used to give it a tangible risk score or is it a case of “reboot that critical server in the middle of the day? Be grand”. Imagine explaining to senior management what went wrong if the Change implodes – have you looked at risk mitigation? Using a formal risk categorisation matrix is key here. Don’t just assume technicians know what makes a change low risk. One of the key complaints from the business is that IT does not understand their pain. Creating a change assessment risk matrix IN A REPEATABLE FORMAT should be your first priority as a Change Manager. If you can’t assess the risk of a change in the same way each time, learning from any mistakes, then you’re not doing Change Management. Period.
Does the proposed timing work with the approved Changes already on the Change Schedule (CS)? Has the Change been clash checked so there are no potential conflicts over services or resources?
Look at the proposed start and end times. Are they sensible (i.e. not rebooting a business critical server at 9 o’clock on Monday morning)? Does the implementation window leave time for anything going wrong or needing to roll back the Change?
Are there any special circumstances that need to be considered? I used to work for Virgin Media; we had Change restrictions and freezes on our TV platforms during key times like the Olympics or World Cup to protect our customer’s experience. If you don’t know when your business critical times are then ask! The business will thank you for it.
The Technical Details:
Have all affected services been identified? What about supporting services? Has someone checked the CMS to ensure all dependencies have been accounted for? Have we referenced the Service Catalogue so that business approvers know what they’re authorising?
Technical Teams Affected
Who will support the Change throughout testing and implementation? Will additional support be needed? What about outside support from external suppliers? Has someone checked the contract to ensure any additional costs have been approved?
User Base Affected
Check and check again. The last thing you want to do is deploy a Change to the wrong area of the business.
What do you mean what environments are we covering? Surely the only environment we need to worry out is our production environment right? Let me share the story of my worst day at work, ever. A long time ago and pre-kids, I worked for a large investment bank in London. A so called routine code change to one of the most business critical systems (the market data feed to our trade floors) took longer than expected so instead of updating both the production and DR environments, only the production environment was updated. The implementation team planned on updating the DR environment but got distracted with other operational priorities (i.e. doing the bidding of whichever senior manager shouted the loudest). Fast forward to 6 weeks later, a crisis hits the trading floor, the call is made to invoke DR but we couldn’t because our market data services were out of sync. Cue a hugely stressful 2 hours where the whole IT organisation and its mum desperately scrambled to find a fix and an estimated cost to the business of over $8 million. Moral of the story? If you have a DR environment; keep it in sync with production.
Are there any licensing implications? Don’t forget, changes in the number of people accessing a system, number of CPUs, or (especially) the way in which people work (moving from dev to prod) have huge impacts on licences.
Pre Implementation Testing
How do we make sure the Change will go as planned? Has the Change been properly tested in an appropriate environment? Has the testing been signed off and have all quality requirements been met?
Post Implementation Verification
OK; the Change has gone in, how do we make sure everything is as it should be? Is there any smoke testing we can carry out? This is particularly important in transactional services; I once saw a Change that went in, everything looked grand but when customers tried to log in the next day, they couldn’t make any changes in their online banking session. I’ll spare you the details of the very shouty senior management feedback; let’s just say fun was most definitely not had that day. If at all possible; test that everything is working; the last thing you need is a total inability to support usual processes following a Change.
Does it make sense and does everyone involved know what they are meant to be doing and when. If other teams are involved are they aware and do we have contact details for them? Are there any dodgy areas where we might need check point calls? Do we need additional support in place such as additional on call / shift resource on duty senior manager to mitigate risk? The plan doesn’t have to be fancy; if you need some inspiration I can share some template implementation plans in our members / subscribers area.
Back Out Plan
What happens if something goes wrong during the Change? Do we fix on fail or roll back? Are the Change implementers empowered to make a decision or is escalation needed? In that case; are senior management aware of the Change and will a designated manager / decision maker be available? Can the Change be backed out in the agreed implementation window or do we need more time? If it looks like restoration work will cause the Change to overrun; warn the business sooner rather than later so that they can put any mitigation plans / workarounds in place.
What Early Life Support Is Planned?
What early Life support is planned? Are floorwalkers needed? Are extra team members needed that day to cope with any questions? Have we got defined exit criteria in place?
Is The Service Desk Aware?
Has someone made the Service Desk aware? Have they been given any training if needed? I know it sounds basic but only a couple of months ago; I had to sit down and explain to an engineer why it was a good idea to let the Service Desk know before any Changes went live. Let’s face it; if something goes wrong the Service Desk are going to be at the sharp end of things. And speaking as an ex Service Desk manager (a very long time ago when they were still called Help Desks) there is nothing worse than having to deal with customers suffering from the fallout of a Change that you know nothing about.
Has the Change been comm’ed out properly? Do we have nice templates so Change notification have a consistent look and feel?
If the business are pushing for a Change to be fast tracked with minimum testing can you ask them to formally acknowledge the risk by relaxing any SLA?
The above list isn’t exhaustive but it’s a sensible starting point. There’s lots of guidance out there; ITIL has the 7 R’s of Change Management and COBIT has advice on governance. What do you look for when assessing Changes? Let me know in the comments!
Recently I’ve been working on Incident Management, and specifically on Major Incident planning.
During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.
For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.
At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.
ITIL talks about Major Incident planning in a brief but fairly helpful way:
A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.
So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.
The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.
What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.
Working with a Major Incident
When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.
Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.
Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.
If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!
What is a Major Incident
It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.
Having this defined is an important step, but will vary between organisations.
Roughly speaking a generic definition of a Major Incident could be
An Incident affecting more than one user
An Incident affecting more than one business unit
An Incident on a device on a certain type – Core switch, access router, Storage Area Network
Complete loss of a service, rather than degregation
Is a P1 Incident a Major Incident?
No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.
Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.
Do I need a single Incident or multiple Incidents for logging a Major Incident?
This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.
The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.
If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.
Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.
Is a Major Incident a Problem?
No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.
Remember the intended outcome of the Incident and Problem Management processes:
Incident Management: The outcome is a restoration of service for the users
Problem Management: The outcome is the identification and possibly removal of the causes of Incidents
The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.
Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.
How can I measure my Major Incident Procedure?
I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.
Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently
There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.