Following on from part one, here are my next seven tips on on how to use availability, incident and problem management to maximise service effectiveness.
Tip 4: If you can’t measure it, you can’t manage it
Ensure that your metrics map all the way back to your process goals via KPIs and CSFs so that when you measure service performance you get clear tangible results rather than a confused set of metrics that no one ever reads let alone takes into account when reviewing operational performance. In simple terms, your service measurements should have a defined flow like the following:
Start with a mission statement so that you have a very clearly defined goal. An example could be something like “to monitor, manage and restore our production environment effectively, efficiently & safely”.
Next come your critical success factors or CSFs. CSFs are the next level down in your reporting hierarchy. They take the information held in the goal statement and break them down into manageable chunks. Example CSFs could be:
- “To monitor our production environment effectively, efficiently & safely”
- “To manage our production environment effectively, efficiently & safely”
- “To restore our production environment effectively, efficiently & safely”
KPIs or key performance indicators are the next step. KPIs provide the level of granularity needed so that you know you are hitting your CSFs. Some example KPIs could be:
- Over 97% of our production environment is monitored
- 98% of all alerts are responded to within 5 minutes
- Over 95% of Calls to the Service Desk are answered within 10 seconds
- Service A achieves an availability of 99.5% during 9 – 5, Monday – Friday
Ensure that your metrics, KPIs & CSFs map all the way back to your mission statement & process goals so that when you measure service performance you get clear tangible results. If your metrics are linked in a logical fashion, if your performance goes to amber during the month (eg threat of service level breach) you can look at your KPIs and come up with an improvement plan. This will also help you move towards a balanced scorecard model as your process matures.
Tip 5: Attend CAB!
Availability, incident and problem managers should be key and vocal members of the CAB. 70%-80% of incidents can be traced to poorly implemented changes.
Problem management should have a regular agenda item to report on problems encountered and especially where these are caused by changes. Incident management should also attend so that if a plan change does go wrong, they are aware and can respond quickly & effectively. In a very real sense being forewarned is forearmed so if a high risk change has been authorised, having that information can help the service desk manager to forward plan for example having extra analysts on shift the morning of a major release.
Start to show the effects of poorly planned and designed change with downtime information to alter mind-sets of implementation teams. If people see the consequences of poor planning or not following the agreed plan, there is a greater incentive to learn from them and by prompting teams to think about quality, change execution will improve, there will be a reduction in related incidents and problems and availability will improve.
Tip 6: Link your information
You must be able to link your information. Working in your own little bubble no longer works, you need to engage with other teams to add value. The best example of this is linking Incidents to problem records to identify trends but it doesn’t stop there. The next step is to look at the trends and look at how they can be fixed. This could be reactive e.g raising a change record to replace a piece of server hardware which has resulted in down time. It could also be proactive for example “ we launched service A and experienced X, Y and Z faults which caused a hit to our availability, we’re now launching service B, what can we do to make sure we don’t make the same mistakes? Different hardware? More resilience? Using the cloud?”
You need to have control over the quality of the information that can be entered. Out of date information is harmful so make sure that validation checks are built in to your process. One way to do this is to do a “deep dive” into your Incident information. Look at the details to ensure a common theme exists and that it is linked to the correct Problem record.
Your information needs to be accessible and easy to read. Your audience sees Google and their expectation is that all search engines work in the same way.
Talk to people! Ask relationship and service delivery managers what keeps them awake at night and if there is know problem record or SIP then raise one. Ask technical teams what are their top ten tech concerns. I’ve said it before and I’ll say it again. Forewarned it forearmed. If you know there’s an issue or potential for risk you can do something about it, or escalate to the manager or team that can. Ask the customer if there is anything they are worried about. Is there a critical product launch due? Are the auditors coming? This is where you can be proactive and limit risk for example working with change management to implement a change freeze.
Tip 7: Getting the right balance of proactive and reactive activities
It’s important to look at both the proactive and reactive sides of the coin and get a balance between the two. If you focus on reactive activities only, you never fix the root cause or make it better; you’ll just keep putting out the same fires. If you focus on proactive activities only, you will lose focus on the BAU and your service quality could spiral out of control.
Proactive actions could include building new services with availability in mind, working with problem management to identify trends and ensuring that high availability systems have the appropriate maintenance (e.g regular patches, reboots, agreed release schedules) Other activities could include identifying VBFs (more on that later) and SPOFs (single points of failure).
Reactive activities could include working with incident management to analyse service uptime / downtime in more granularity with the expanded incident cycle and acting on lessons learned from previous failures.
Tip 8: Know your VBFs
No, not your very best friends, your vital business functions! Talk to your customers and ask them what they consider to be critical. Don’t assume. That sparkling new CRM system may be sat in the corner gathering dust. That spreadsheet on the other hand, built on an ancient version of excel with tens of nested tables and lots of macros could be a critical business tool for capturing customer information. Go out and talk to people. Use your service catalogue. Once you have a list of things you must protect at all costs you can work through the list and mitigate risk.
Tip 9: Know how to handle downtime
No more hiding under your desk or running screaming from the building! With the best will in the world, things will go wrong so plan accordingly. The ITIL service design book states that “recognising that when services fail, it is still possible to achieve business, customer & user satisfaction and recognition: the way a service provider acts in failure situation has a major influence on customer & user perception & expectation.”
Have a plan for when downtime strikes. Page 1 should have “Don’t Panic” written in bright, bold text – sounds obvious but it’s amazing how many people panic and freeze in the event of a crisis. Work with incident and problem management to come up with the criteria for a major incident that works for your organisation. Build the process and document everything even the blindingly obvious (because you can’t teach common sense). Agree in advance who will coordinate the fix effort (probably Incident management) and who will investigate the root cause (problem management). Link in to your IT service continuity management process. When does an incident become so bad that we need to invoke DR? Have we got the criteria documented? Who makes the call? Who is their back up in case they’re on holiday or off sick? Speak to capacity management – they look at performance – at what point could a performance issue become so bad that the system becomes unusable. Does that count as down time? Who investigates further?
Tip 10: Keep calms and carry on
Your availability, incident and problem management processes will improve and mature over time. Use any initial “quick wins” to demonstrate the value add and get more buy in. As service levels improve, your processes will gather momentum as its human nature to want to jump on the bandwagon if something is a storming success.
As your process matures, you can look to other standards and framework. Agile and lean can be used to make efficiency savings. COBIT can be used to help you gauge process maturity as well as practical guidance on getting to the next level. PRINCE2 can help with project planning and timescales. You can also review your metrics to reflect greater process maturity for example you could add critical to quality (CTQ) and operational performance indicators (OPIs) to your existing deck of goals, CSFs and KPIs.
I’d like to conclude by saying that availability, incident and problem management processes are critical to service quality. They add value on their own, but aligning them and running them together will not only drive improvement but will also reduce repeat (boring) incidents, move knowledge closer to the front line and increases service uptime.
In conclusion, having availability, incident and problem management working together as a trio is one of the most important steps in moving an IT department from system management to service management as mind-sets start to change, quality improves and customer satisfaction increases.