Problem Management Defined

Rob England: Problem Management Defined

Problem Management DefinedRailways (railroads) remind us of how the real world works.

In our last article, we left Cherry Valley, Illinois in its own little piece of hell.

For those who missed the article, in 2009 a Canadian National railroad train carrying eight million litres of ethanol derailed at a level crossing in the little town of Cherry Valley after torrential rain washed out the roadbed beneath the track. 19 tankers of ethanol derailed, 13 of them split or spilled, and the mess somehow caught fire in the downpour.

One person in the cars waiting at the crossing died and several more were seriously injured.

Incidents vs. Problems

In that previous article we looked at the Incident Management. As I said then, an incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes. I also mentioned that ITIL doesn’t see it that crisply delineated. Anyway, let us return to Cherry Valley…

One group of people worked inside office buildings making sure the trains kept rolling around the obstruction so that the railroad met its service obligations to its users. This was the Incident Management practice: restoring service to the users, focusing on perishable deliveries such as livestock and fruit.

Another group thrashed around in the chaos that was Cherry Valley, trying to fix a situation that was very very broken. Their initial goal was containment: save and treat people in vehicles, evacuate surrounding houses, stop the fire, stop the spills, move the other 100 tank-cars of ethanol away, get rid of all this damn flooding and mud.

The Shoo-fly

The intermediate goal was repair and restore: get trains running again. Often this is done with a “shoo-fly”: a temporary stretch of track laid around the break, which trains inch gingerly across whilst more permanent repairs are effected. This is not a Workaround as we use the term in ITSM. The Workaround was to get trains onto alternate routes or pass freight to other companies. A shoofly is temporary infrastructure: it is part of the problem fix just as a temporary VM server instance would be. While freight ran on other roads or on a shoofly, they would crane the derailed tankers back onto the track or cart them away, then start the big job of rebuilding the road-base that had washed away – hopefully with better drains this time – and relaying the track. Compared to civil engineering our IT repairs look quick, and certainly less strenuous.

Which brings us to the longer-term goal: permanent remediation of the problem. Not only does the permanent fix include new rail roadbed and proper drainage; the accident report makes it clear that CN’s procedures and communications were deficient as well. Cherry Valley locals were calling 911 an hour beforehand to report the wash-out.

Damage Limitation

We will talk more about the root causes and long term improvement later. Let’s stay in Cherry Valley for now. It is important to note that the lives and property the emergency responders were saving were unconnected to the services, users or customers of the railroad. All the people working on all these aspects of the problem had only a secondary interest in the timeliness of pigs and oranges and expensive petrol. They were not measured on freight delivery times: they were measured on speed, quality and permanence of the fix, and prevention of any further damage.

If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. To me it is pretty clear: Incident and Problem practices are delineated by different activities, teams, skills, techniques, tools, goals and metrics. Incident: user service levels. Problem: causes.

While I am arguing with ITIL definitions, let’s look at another aspect of Incidents. ITIL says that something broken is an Incident if it could potentially cause a service interruption in future. Once again this ignores the purpose, roles, skills and tools of Incident Management and Problem Management. Such a fault is clearly a Problem, a (future) cause of an Incident.

(Incidentally, it is hard to imagine many faults in IT that aren’t potentially the cause of a future interruption or degradation of service. If we follow this reasoning to its absurd conclusion, every fault is an incident and nothing is a problem).

Perhaps one reason ITIL hangs these “potential incidents” where it does is because of another odd definition: ITIL says a Problem is the cause of “one or more incidents”. What’s odd about that? ITIL promotes pro-active (better called pre-emptive) problem management, and yet apparently we need to wait until something causes at least one incident before we can start treating it as a problem. I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town. (Actually ITIL lost proactive problem management from ITIL V3 but it was hastily restored in ITIL 2011).

Human Eyeball

One of my favourite railroad illustrations is about watching trains. When a train rolls by, keep an eye on nearby staff: those on platforms, down by the track, on waiting trains. On most railroads, staff will stop what they are doing and watch the train – the whole train, watching until it has all gone by. In the old days they would wave to the guard (conductor) on the back of the train. Nowadays they may say something to the driver via radio.

Laziness? Sociability? Railfans? Possibly. But quite likely it is part of their job – it may well be company policy that everybody watches every passing train. The reason is visual inspection. Even in these days of radio telemetry from the FRED (Flashing Rear End Device, a little box on the back that replaces the caboose/guardsvan of old) and track-side detectors for cracked wheels and hotboxes (overheating bearings), there is still no substitute for the good old human eyeball for spotting anything from joyriders to dragging equipment. It is everyone’s responsibility to watch and report: not a bad policy in IT either.

What they are spotting are Problems. The train is still rolling so the service hasn’t been interrupted … yet.

Other Problems make themselves known by interrupting the service. A faulty signal stops a train. In the extreme case the roadbed washes away. We can come up with differing names for things that have and haven’t interrupted/degraded service yet, but I think that is arguing about angels dancing on pinheads. They are all Problems to me: the same crews of people with heavy machinery turn out to fix them while the trains roll by delivering they care not what to whom. Oh sure, they have a customer focus: they care that the trains are indeed rolling and on time, but the individual service levels and customer satisfaction are not their direct concern. There are people in cozy offices who deal with the details of service levels and incidents.
Next time we will return to the once-again sleepy Cherry Valley to discuss the root causes of this accident.

2 thoughts on “Rob England: Problem Management Defined”

Comments are closed.