In the run up this year’s itSMF UK conference ITSM14, I chatted with Tobias Nyberg about his upcoming session entitled “Bring me problems – not solutions”.
Q. Hi Tobias, can you give a quick intro to your session at the itSMF UK Conference?
The presentation will address the essence of problems and show why the ways we look at problems are important to all organisations and professionals. It will also give the attendees hands on advice based on how we have worked with this simple but effective method to improve change and development in my organisation.
The presentation isn’t core IT Service Management but the method we have used to define problems for better solutions is applicable for most change efforts within any ITSM organisation – big and small. It’s definitely not rocket science but at the same time, sometimes the most obvious things are also the ones we most easily forget or ignore.
When we started to work with problem management everyone was very eager to start fixing things where we sometimes missed clearly defining and commonly agreeing on the root causes. And even when we tried using some of the popular methods like “Ishikawa” and “The 5 Whys”, we quite rapidly drifted into discussions on how to solve things.
When addressing this I found that people are worried and often feel anxiety when talking about problems and that it is a bad thing owning a problem. It was hard to keep them in the problem-zone because they felt uncomfortable – they wanted to jump into the solution-zone as quick as possible because everyone loves a problem solver.
Q. What are likely to be the potential issues an organisation may experience with attempting to work in this way?
The emotional and cultural relationships people and organisations have to problems differ some from case to case but it seems to be difficult to have a positive attitude towards problems.
So this presentation expands on that, with hopes that the audience return to their workplaces with a new view of problems and a different perspective on their value.
Tobias Nyberg is a Process Owner and Process Manager at Svenska Handelsbanken in Stockholm, Sweden. He has a growing interest in IT Service Management and how ITSM can deliver value to companies and people. Tobias strongly believes in sharing as the best way of boosting knowledge in the ITSM community and is an active member of the Swedish itSMF chapter.
Tobias’ session is on day one of ITSM14 and featured within the Back to Basics track. To find out more or to book your conference place please visit itSMF UK
Problem Management is the intriguing discipline of the Service Management suite. The IT Department is continually being asked to be proactive not reactive.
Often in IT we presuppose what our customers in the business require, then give them a solution to issues that they didn’t know that they had. But what happens when that business customer is asking IT for a permanent solution to an issue we might not have known that we had, or to an issue where we know only a sticking plaster fix is in place?
Your Problem Manager is the key
Step up to the plate the Problem Manager, the individual focussed on reacting to, and managing, issues that have already happened. They can’t really help but have a reactive mindset, rooted in the analysis of fact. The incident might be closed but the Problem Manager is the person entrusted with ensuring that appropriate steps are taken to guarantee the incident doesn’t repeat itself. It can be a stressful role, the systems were down, the company perhaps lost, and may still be, losing money, trading has been impacted. People want to know what is being done. So what SLAs can be put in place between the Problem Manager and the service owner to support the Problem Manager’s activities and maybe give them breathing space, whilst at the same time ensuring that there is some focus on resolution?
Lets look at the four problem management SLAs that you really can’t live without
#1 – Provision of Problem Management reference number
A simple SLA to get you started. This is simply an acknowledgement by the problem management team that the problem has been logged, referenced and is in the workflow of the team. It provides reassurance that the problem is going to be dealt with.
#2 – Time to get to the root cause of the issue
So this is where some breathing space is provided. The message being given in this particular SLA is that there is a distinction between incident management and problem management. Incident management has resulted in a temporary fix to an issue, now it is the turn of problem management to actually work out what lay at the heart of the matter – what was the root cause.
Note this is an SLA about identifying and not resolving the root cause – that could take a significant time period involving redevelopment of code.
The outcome that is being measured by the SLA is going to be the production of a deliverable, perhaps in the form of a brief document or even just an email that highlights the results of the root cause analysis. Each company will have to determine its own policy of what that deliverable might contain, but the SLA is there to measure the time between the formal closure of the incident and the formal provisioning time of problem management’s root cause analysis deliverable.
#3 – Measurement of provision of Root Cause Analysis documentation. To be provided within X working days of initial notification.
So, you’ve acknowledged receipt of the problem, and you’ve determined the root cause. The next SLA is in place to ensure that a formal document is delivered in a timely fashion. It should have a set format and set down the timeline of events that caused the problem, and actions that have been taken to provide a workaround. It should then list all of the actions and recommendations together with clearly identified owners that need to be completed by realistic dates in order to fix the problem. A suggested target date would be 3 days for simple problems and 5 and 10 days for increasingly more complex ones.
#4 – Measurement of progress on root cause analysis actions as agreed (Target dates not to change more than twice)
In the previous SLA we have measured the time to produce the root cause analysis. This SLA takes over where the previous clock stopped.
The root cause analysis work will have identified actions that need to be undertaken and implemented to affect a permanent fix to the original issue and allow the sticky plaster solution to be superseded.
However, all resolutions will not be equal in complexity, effort and duration, therefore there will be an initial estimation of a target date for live implementation of a permanent fix. Moving the target completion date is allowed, however this SLA limits how often this can occur to prevent action timescales drifting.
This article has been contributed by Simon Higginson of Frimley Green Ltd, Simon’s expertise is helping clients get the best out of their service suppliers and creating win-win partnerships.
Most readers have got the story now from my recent articles: Cherry Valley, Illinois, 2009, rain bucketing down, huge train-load of ethanol derails, fire, death, destruction.
Eventually the Canadian National’s crews and the state’s emergency services cleaned up the mess, and CN rebuilt the track-bed and the track, and trains rolled regularly through Cherry Valley again.
Then the authorities moved in to find out what went wrong and to try to prevent it happening again. In this case the relevant authority is the US National Transportation Safety Board (NTSB).
Keep asking why?
Every organisation should have a review process for requests, incidents, problems and changes, with some criteria for triggering a review.
In this case it was serious enough that an external agency reviewed the incident. The NTSB had a good look and issued a report. Read it as an example of what a superb post-incident review looks like. Some of our major IT incidents involve as much financial loss as this one and sadly some also involve loss of life.
IT has a fascination with “root cause”. Root Cause Analysis (RCA) is a whole discipline in its own right. The Kepner-Tregoe technique (ITIL Service operation 2011 Appendix C) calls it “true cause”.
The rule of thumb is to keep asking “Why?” until the answers aren’t useful any more, then that – supposedly – is your root cause.
This belief in a single underlying cause of things going wrong is a misguided one. The world doesn’t work that way – it is always more complex.
The NTSB found a multitude of causes for the Cherry Valley disaster. Here are just some of them:
It was extreme weather
The CN central rail traffic controller (RTC) didn’t put out a weather warning to the train crew which would have made them slow down, although required to do so and although he was in radio communication with the crew
The RTC did not notify track crews
The track inspector checked the area at 3pm and observed no water build-up
Floodwater washed out a huge hole in the track-bed under the tracks, leaving the rails hanging in the air.
Railroads normally post their contact information at all grade crossings but the first citizen reporting the washout could not find the contact information at the crossing where the washout was, so he called 911
The police didn’t communicate well with CN about the washed out track: they first alerted two other railroads
There wasn’t a well-defined protocol for such communication between police and CN
Once CN learned of the washout they couldn’t tell the RTC to stop trains because his phone was busy
Although the train crew saw water up to the tops of the rails in some places they did not slow down of their own accord
There was a litany of miscommunication between many parties in the confusion after the accident
The federal standard for ethanol cars didn’t require them to be double-skinned or to have puncture-proof bulkheads (it will soon: this tragedy triggered changes)
There had been a previous washout at the site and a 36” pipe was installed as a relief drain for flooding. Nobody calculated what size pipe was needed and nobody investigated where the flood water was coming from. After the washout the pipe was never found.
The county’s storm-water retention pond upstream breached in the storm. The storm retention pond was only designed to handle a “ten year storm event”.
Local residents produced photographic evidence that the berm and outlet of the pond had been deteriorating for several years beforehand.
OK you tell me which is the “root cause”
Causes don’t normally arrange themselves in a nice tree all leading back to one. There are several fundamental contributing causes. Anyone who watches Air Crash Investigation knows it takes more than one thing to go wrong before we have an accident.
Sometimes one of them stands out like the proverbial. So instead of calling it root cause I’m going to call it primary cause. Sure the other causes contributed but this was the biggest contributor, the primary.
Ian Clayton once told me that root cause …er… primary cause analysis is something you do after the fact as part of the review and wash-up. In the heat of the crisis who gives a toss what the primary cause is – remove the most accessible of the causes. Any disaster is based on multiple causes. Removing any one cause of an incident will likely restore service. Then when we have time to consider what happened and take steps to prevent a recurrence, we should probably try to address all causes. Don’t do Root cause Analysis, just do Cause Analysis, seeking the multiple contributing causes. If we need to focus efforts then the primary cause is the one, which implies that a key factor in deciding primacy is how broad the potential is for causing more incidents.
All complex systems are broken
It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management. You must read it. Now. I’ll wait.
It says that all complex systems are broken. It is only when the broken bits line up in the right way that the system fails.
It dates from 1998! Richard Cook is a doctor, an MD. He seemingly knocked this paper off on his own. It is a whole four pages long, and he wrote it with medical systems in mind. But that doesn’t matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.
“Complex systems run as broken systems”
“Change introduces new forms of failure”
“Views of ‘cause’ limit the effectiveness of defenses against future events… likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly.”
“Failure free operations require experience with failure.”
Many times the person “to blame” for a primary cause was just doing their job. All complex systems are broken. Every day the operators make value judgements and risk calls. Sometimes they don’t get away with it. There is a fine line between considered risks and incompetence – we have to keep that line in mind. Just because they caused the incident doesn’t mean it is their fault. Think of the word “fault” – what they did may not have been faulty, it may just be what they have to do every day to get the job done. Too often, when they get away with it they are considered to have made a good call; when they don’t they get crucified.
That’s not to say negligence doesn’t happen. We should keep an eye out for it, and deal with it when we find it. Equally we should not set out on cause analysis with the intent of allocating blame. We do cause analysis for the purpose of preventing a recurrence of a similar Incident by removing the existing Problems that we find.
I will close by once again disagreeing with ITIL’s idea of Problem Management. As I said in my last article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.
It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?” That is dealt with by Risk Management (an essential practice that ITIL does not even recognise) feeding into Continual Service Improvement to remove the risk. The NTSB were not doing Problem Management at Cherry Valley.
Next time we will look at continual improvement and how it relates to problem prevention.
I was wondering – do you have a Known Error Database? And are you getting the maximum value out of it?
The concept of a KEDB is interesting to me because it is easy to see how it benefits end users. Also because it is dynamic and constantly updated.
Most of all because it makes the job of the Servicedesk easier.
It is true to say that an effective KEDB can both increase the quality and decrease the time for Incident resolution.
The Aim of Problem Management and the Definition of “The System”
One of the aims of Problem Management is to identify and manage the root causes of Incidents. Once we have identified the causes we could decide to remove these problems to prevent further users being affected.
Obviously this might be a lengthy process – replacing a storage device that has an intermittent fault might take some scheduling. In the meantime Problem Managers should be investigating temporary resolutions or measures to reduce the impact of the Problem for users. This is known as the Workaround.
When talking about Problem Management it helps to have a good definition of “Your System”. There are many possible causes of Incidents that could affect your users including:
Networks, connectivity, VPN
Services – in-house and outsourced
Policies, procedures and governance
Documentation and Training materials
Any of these components could cause Incidents for a user. Consider the idea that incorrect or misleading documentation would cause an Incident. A user may rely on this documentation and make assumptions on how to use a service, discover they can’t and contact the Servicedesk.
This documentation component has caused an Incident and would be considered the root cause of the Problem
Where the KEDB fits into the Problem Management process
The Known Error Database is a repository of information that describes all of the conditions in your IT systems that might result in an incident for your customers and users.
As users report issues support engineers would follow the normal steps in the Incident Management process. Logging, Categorisation, Prioritisation. Soon after that they should be on the hunt for a resolution for the user.
This is where the KEDB steps in.
The engineer would interact with the KEDB in a very similar fashion to any Search engine or Knowledgebase. They search (using the “Known Error” field) and retrieve information to view the “Workaround” field.
The “Known Error”
The Known Error is a description of the Problem as seen from the users point of view. When users contact the Servicedesk for help they have a limited view of the entire scope of the root cause. We should use screenshot of error messages, as well as the text of the message to aid searching. We should also include accurate descriptions of the conditions that they have experienced. These are the types of things we should be describing in the Known Error field A good example of a Known Error would be:
When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form.
The Known Error should be written in terms reflecting the customers experience of the Problem.
The Workaround is a set of steps that the Servicedesk engineer could take in order to either restore service to the user or provide temporary relief. A good example of a Workaround would be:
To workaround this issue add the timesheet application to the list of Trusted sites
1. Open Internet Explorer 2. Tools > Options > Security Settings [ etc etc ]
The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Servicedesk should take to help the user, has multiple benefits – some more obvious than others.
Seven Benefits of Using a Known Error Database (KEDB)
Faster restoration of service to the user – The user has lost access to a service due to a condition that we already know about and have seen before. The best possible experience that the user could hope for is an instant restoration of service or a temporary resolution. Having a good Known Error which makes the Problem easy to find also means that the Workaround should be quicker to locate. All of the time required to properly understand the root cause of the users issue can be removed by allowing the Servicedesk engineer quick access to the Workaround.
Repeatable Workarounds – Without a good system for generating high-quality Known Errors and Workarounds we might find that different engineers resolve the same issue in different ways. Creativity in IT is absolutely a good thing, but repeatable processes are probably better. Two users contacting the Servicedesk for the same issue wouldn’t expect a variance in the speed or quality of resolution. The KEDB is a method of introducing repeatable processes into your environment.
Avoid Re-work – Without a KEDB we might find that engineers are often spending time and energy trying to find a resolution for the same issue. This would be likely in distributed teams working from different offices, but I’ve also seen it commonly occur within a single team. Have you ever asked an engineer if they know the solution to a users issue to be told “Yes, I fixed this for someone else last week!”. Would you have prefered to have found that information in an easier way?
Avoid skill gaps – Within a team it is normal to have engineers at different levels of skill. You wouldn’t want to employ a team that are all experts in every functional area and it’s natural to have more junior members at a lower skill level. A system for capturing the Workaround for complex Problems allows any engineer to quickly resolve issues that are affecting users.Teams are often cross-functional. You might see a centralised application support function in a head-office with users in remote offices supported by their local IT teams. A KEDB gives all IT engineers a single place to search for customer facing issues.
Avoid dangerous or unauthorised Workarounds – We want to control the Workarounds that engineers give to users. I’ve had moments in the past when I chatted to engineers and asked how they fixed issues and internally winced at the methods they used. Disabling antivirus to avoid unexpected behaviour, upgrading whole software suites to fix a minor issue. I’m sure you can relate to this. Workarounds can help eliminate dangerous workarounds.
Avoid unnecessary transfer of Incidents – A weak point in the Incident Management process is the transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else queue of work. Often with not enough detailed context or background information. Enabling the Servicedesk to resolve issues themselves prevents transfer of ownership for issues that are already known.
Get insights into the relative severity of Problems – Well written Known Errors make it easier to associate new Incidents to existing Problems. Firstly this avoids duplicate logging of Problems. Secondly it gives better metrics about how severe the Problem is. Consider two Problems in your system. A condition that affects a network switch and causes it to crash once every 6 months. A transactional database that is running slowly and adding 5 seconds to timesheet entry You would expect that the first Problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core switch would be more urgent that a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly associate Incidents against existing Problems allows you to judge the relative impact of each one.
The KEDB implementation
Technically when we talk about the KEDB we are really talking about the Problem Management database rather than a completely separate store of data. At least a decent implementation would have it setup that way.
There is a one-to-one mapping between Known Error and Problem so it makes sense that your standard data representation of a Problem (with its number, assignment data, work notes etc) also holds the data you need for the KEDB.
It isn’t incorrect to implement this in a different way – storing the Problems and Known Errors in seperate locations, but my own preference is to keep it all together.
Known Error and Workaround are both attributes of a Problem
Is the KEDB the same as the Knowledge Base?
This is a common question. There are a lot of similarities between Known Errors and Knowledge articles.
I would argue that although your implementation of the KEDB might store its data in the Knowledgebase they are separate entities.
Consider the lifecycle of a Problem, and therefore the Known Error which is, after all, just an attribute of that Problem record.
The Problem should be closed when it has been removed from the system and can no longer affect users or be the cause of Incidents. At this stage we could retire the Known Error and Workaround as they are no longer useful – although we would want to keep them for reporting so perhaps we wouldn’t delete them.
Knowledgebase articles have a more permanent use. Although they too might be retired, if they refer to an application due to be decommissioned, they don’t have the same lifecycle as a Known Error record.
Knowledge articles refer to how systems should work or provide training for users of the system. Known Errors document conditions that are unexpected.
There is benefit in using the Knowledgebase as a repository for Known Error articles however. Giving Incident owners a single place to search for both Knowledge and Known Errors is a nice feature of your implementation and typically your Knowledge tools will have nice authoring, linking and commenting capabilities.
What if there is no Workaround
Sometimes there just won’t be a suitable Workaround to provide to customers.
I would use an example of a power outage to provide a simple illustration. With power disrupted to a location you could imagine that there would be disruption to services with no easy workaround.
It is perhaps a lazy example as it doesn’t allow for many nuances. Having power is a normally a binary state – you either have adequate power or not.
A better and more topical example can be found in the Cloud. As organisations take advantage of the resource charging model of the Cloud they also outsource control.
If you rely on a Cloud SaaS provider for your email and they suffer an outage you can imagine that your Servicedesk will take a lot of calls. However there might not be a Workaround you can offer until your provider restores service.
Another example would be the February 29th Microsoft Azure outage. I’m sure a lot of customers experienced a Problem using many different definitions of the word but didn’t have a viable alternative for their users.
In this case there is still value to be found in the Known Error Database. If there really is no known workaround it is still worth publishing to the KEDB.
Firstly to aid in associating new Incidents to the Problem (using the Known Error as a search key) and to stop engineers in wasting time in searching for an answer that doesn’t exist.
You could also avoid engineers trying to implement potentially damaging workarounds by publishing the fact that the correct action to take is to wait for the root cause of the Problem to be resolved.
Lastly with a lot of Problems in our system we might struggle to prioritise our backlog. Having the Known Error published to help routing new Incidents to the right Problem will bring the benefit of being able to prioritise your most impactful issues.
A users Known Error profile
With a populated KEDB we now have a good understanding of the possible causes of Incidents within our system.
Not all Known Errors will affect all users – a network switch failure in one branch office would be very impactful for the local users but not for users in another location.
If we understand our users environment through systems such as the Configuration Management System (CMS) or Asset Management processes we should be able to determine a users exposure to Known Errors.
For example when a user phones the Servicedesk complaining of an interruption to service we should be able to quickly learn about her configuration. Where she is geographically, which services she connects to. Her personal hardware and software environment.
With this information, and some Configuration Item matching the Servicedesk engineer should have a view of all of the Known Errors that the user is vulnerable to.
Measuring the effectiveness of the KEDB.
As with all processes we should take measurements and ensure that we have a healthy process for updating and using the KEDB.
Here are some metrics that would help give your KEDB a health check.
Number of Problems opened with a Known Error
Of all the Problem records opened in the last X days how many have published Known Error records?
We should be striving to create as many high quality Known Errors as possible.
The value of a published Known Error is that Incidents can be easily associated with Problems avoiding duplication.
Number of Problems opened with a Workaround
How many Problems have a documented Workaround?
The Workaround allows for the customer Incident to be resolved quickly and using an approved method.
Number of Incidents resolved by a Workaround
How many Incidents are resolved using a documented Workaround. This measures the value provided to users of IT services and confirms the benefits of maintaining the KEDB.
Number of Incidents resolved without a Workaround or Knowledge
Conversely, how many Incidents are resolved without using a Workaround or another form of Knowledge.
If we see Servicedesk engineers having to research and discover their own solutions for Incidents does that mean that there are Known Errors in the system that we aren’t aware of?
Are there gaps in our Knowledge Management meaning that customers are contacting the Servicedesk and we don’t have an answer readily available.
A high number in our reporting here might be an opportunity to proactively improve our Knowledge systems.
want to ensure that Known Errors are quickly written and published in order to allow Servicedesk engineers to associate incoming Incidents to existing Problems.
One method of measuring how quickly we are publishing Known Errors is to use Organisational Level Agreements (or SLAs if your ITSM tool does’t define OLAs).
We should be using performance measurements to ensure that our Problem Management function is publishing Known Errors in a timely fashion.
You could consider tracking Time to generate Known Error and Time to generate Workaround as performance metrics for your KEDB process.
Additionally we could also measure how quickly Workarounds are researched, tested and published. If there is no known Workaround that is still valuable information to the Servicedesk as it eliminates effort in trying to find one so an OLA would be appropriate here.