Rob England: Proactive Problem Management

Just because you rebuild the track doesn’t mean the train won’t derail again.

Rebuilding the track was reactive Problem Management

We have been looking in past articles at the tragic events in little Cherry Valley, Illinois in 2009.  One person died and several more were seriously injured when a train-load of ethanol derailed at a level crossing. We talked about the resulting Incident Management, which focused on customers, trains and cargo – ensuring the services still operated, employing workarounds. Then we considered the Problem Management: the injured people and the wreck and the broken track – removing the causes of service disruption, restoring normal service.

A Problem is a problem, whether it has caused an Incident yet or not

In a previous article I said ITIL has an odd definition of Problem.  ITIL says a Problem is the cause of “one or more incidents”.   ITIL promotes proactive (better called pre-emptive) Problem Management, and yet apparently we need to wait until something causes at least one Incident before we can start treating it as a Problem.  I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town.  A Problem is in fact the cause of zero or more Incidents.  A Problem is a problem, whether it has caused an Incident yet or not.

We talked about how I try to stick to a nice crisp simple model of Incident vs. Problem.  To me, an incident is an interruption to service and a problem is an underlying (potential) cause of incidents.  Incident Management is concerned with the restoration of expected levels of service to the users.  Problem Management is concerned with removing the underlying causes.

ITIL doesn’t see it that crisply delineated: the two concepts are muddied together.  ITIL – and many readers – would say that putting out the fires, clearing the derailed tankers, rebuilding the roadbed, and relaying the rails can be regarded as part of the Incident resolution process because the service isn’t “really” restored until the track is back.

Problems can be resolved with urgency

In the last article I said this thinking may arise because of the weird way ITIL defines a Problem.  I have a hunch that there is a second reason: people consider removing the cause of the incident to be part of the incident because they see Incident=Urgent, Problem=Slow.  They want the incident Manager and the Service Desk staff to hustle until the cause is removed.  This is just silly.   There is no reason why Problems can’t be resolved with urgency.  Problems should be categorised by severity and priority and impact just like Incidents are.  The Problem team should go into urgent mode when necessary to mobilise resources, and the Service Desk are able to hustle the Problem along just as they would an Incident.

This inclusion of cause-removal over-burdens and de-focuses the Incident Management process.  Incident Management should have a laser focus on the user and by implication the customer.  It should be performed by people who are expert at serving the user.  Its goal is to meet the user’s needs.   Canadian National’s incident managers were focused on getting deliveries to customers despite a missing bit of track.

Problem Management is about fixing faults.  It is performed by people expert at fixing technology.  .  The Canadian National incident managers weren’t directing clean-up operations in Cherry Valley: they left that to the track engineers and the emergency services.

Problem management is a mess

But the way ITIL has it, some causes are removed as part of Incident resolution and some are categorised as Problems, with the distinction being unclear (“For some incidents, it will be appropriate…” ITIL Service Operation 2011  The moment you make Incident Management responsible for sometimes fixing the fault as well as meeting the user’s needs, you have a mashup of two processes, with two sometimes-conflicting goals, and performed by two very different types of people.  No wonder it is a mess.

It is a mess from a management point of view when we get a storm of incidents.  Instead of linking all related incidents to an underlying Problem, we relate them to some “master incident” (this isn’t actually in ITIL but it is common practice) .

It is a mess from a prioritisation point of view.   The poor teams who fix things are now serving two processes:  Incident and Problem.  In order to prioritise their work they need to track a portfolio of faults that are currently being handled as incidents and faults that are being handled as problems, and somehow merge a holistic picture of both.  Of course they don’t.   The Problem Manager doesn’t have a complete view of all faults nor does the Incident Manager, and the technical teams are answerable to both.

It is a mess from a data modelling point of view as well.  If you want to determine all the times that a certain asset broke something, you need to look for incidents it caused and problems it caused

Every cause of a service impact (or potential impact) should be recorded immediately as a problem, so we can report and manage them in one place.

All that tirade is by way of introducing the idea of reactive and proactive Problem Management.

Cherry Valley needed Reactive Problem Management

Reactive Problem Management responds to an incident to remove the cause of the disruption to service.  The ITIL definition is more tortuous because it treats “restoring the service” as Incident Management’s job, but it ends up saying a similar thing: “Reactive problem management is concerned with solving problems in response to one or more incidents” (SO 2011 4.4.2).

Pro-active Problem Management fixes problems that aren’t currently causing an incident to prevent them causing incidents (ITIL says “further” incidents).

So cleaning up the mess in Cherry Valley and rebuilding the track was reactive Problem Management.

Once the trains were rolling they didn’t stop there.  Clearly there were some other problems to address.  What caused the roadbed to be washed away in the first place?  Why did a train thunder into the gap at normal track speed?  Why did the tank-cars rupture and how did they catch fire?

Find the problems that need fixing

In Cherry Valley, the drainage was faulty.  Water was able to accumulate behind the railway roadbed embankment, causing flooding and eventually overflowing the roadbed, washing out below the track, leaving rails dangling in the air.  The next time there was torrential rain, it would break again.  That’s a problem to fix.

Canadian National’s communication processes were broken.  The dispatchers failed to notify the train crew of a severe weather alert, which they were supposed to do.  If they had, the train would have operated at reduced speed.  That’s a problem to fix.

The CN track maintenance processes worked, perhaps lackadaisically but they worked as designed.  The processes could have been a lot better, but were they broken?  No.

The tank cars were approved for transporting ethanol.   Those were not required to be equipped with head shields (extra protection at the ends of the tank to resist puncturing), jackets, or thermal protection.  In March 2012 the US National Transportation Safety Board (NTSB) recommended (R-12-5 ) “that all newly manufactured and existing general service tank cars authorized for transportation of denatured fuel ethanol … have enhanced tank head and shell puncture resistance systems”.  The tank-cars weren’t broken (before the crash).  This is not fixing a problem; it is improving the safety to mitigate the risk of rupture.

Proactive Problem Management prevents the recurrence of Incidents

I don’t think pro-active Problem Management is about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.  That is once again over-burdening a process.  If you delve too far into preventing future problems, you cross over into Availability and Capacity and Risk Management and Service Improvement, (and Change Management!), not Problem Management.

ITIL agrees: “Proactive problem management is concerned with identifying and solving problems and known errors before further incidents related to them can occur again”.   Proactive Problem Management prevents the recurrence of Incidents, not Problems.

In order to ensure that incidents will not recur, we need to dig down to find all the underlying causes.  In many methodologies we go after that mythical beast, the Root Cause.  We will talk about that next time.

Image credit

4 thoughts on “Rob England: Proactive Problem Management”

    1. I think what you mean is that we have differing definitions of a problem. I spent a while series of articles – including this one – explaining what i think a problem is. So I’m pretty sure I do understand it.
      If you have a differing view of a problem, we could discuss and debate it here. Or you could just be patronising.

  1. As is often the case, I mostly agree with you, but there are some sticking points…
    While ITIL does say that a Problem is the underlying cause of one or more Incidents, it also says that an Incident doesn’t necessarily have to be an interruption or degradation of service delivery. An Incident CAN BE a threat to interruption or degradation of service delivery. Therefore, there can be a Problem – even if there is no customer impact at all.

    The other sticking point for me is that sometimes resolving an Incident is about restoring normal service expectations, sometimes it’s about providing an alternate means of customer outcome realization, and sometimes it requires removal of the cause…

    Distinct separation example:
    Incident: I don’t have my file.
    Problem: Email server is down.
    Incident Resolution: FTP the file.
    Incident Closure: Bring the email server back up.
    Problem Resolution: Figure out why it crashed and take steps to avoid recurrence.
    Problem Closure: Fully implement and test the proposed permanent fix.

    Alternatively, what if the file can only be delivered through a specific secure file transfer utility, because it’s the only delivery method the client will use?
    The only way to resolve the Incident, in this case, is to restore the functionality of the file transfer utility, and if we don’t know why it’s down, what happens? What should happen (speaking from a former Incident Manager’s perspective) is Incident Management should still retain control over managing the Incident, and a high priority should be placed on accessing the people who would be best at RCA for that particular system. In this situation, I would have Problem Management on that call as a resource, but I still owned the Incident… In that moment, they reported to me, or they got the Hell out of my way.
    Once the file transfer utility was back up and the client got the file, only then would I allow Problem Management to assume ownership of the situation.

    The line between Incident and Problem activities will always be somewhat dynamic, but ownership and accountability must be crystal clear.

Comments are closed.