BAU Improvements

In my last article on service improvement, I laid out four premises that underlie how I think we should approach CSI:

Process improvements evolve with time on railroads
  • Everything we change is service improvement.
  • Improvement planning comes first.
  • We don’t have enough resource to execute all desired improvements.
  • We choose the wrong unit of work for improvements.

What are the desired business outcomes?

We must focus on what is needed.  To understand the word ‘needed’ we go back to the desired business outcomes.  Then we can make a list of the improvement outputs that will deliver those outcomes, and hence the pieces of work we need to do.

Even then we will find that the list can be daunting, and some sort of ruthless expediency will have to be applied to choose what does and doesn’t get done.

How will you resource the improvements?

The other challenge will be resourcing the improvements, no matter how ruthlessly we cut down the list.  Almost all of us work in an environment of shrinking budgets and desperate shortages of every resource:  time , people and money.  One way to address this is to do some of the work as part of BAU.

These are all aspects of my public-domain improvement planning method, Tipu:

  • Alignment to business outcomes
  • Ruthless decision making
  • Doing much of the work as part of our day jobs

Let me give you two more premises that build on the first four and take us to the heart of how I approached service improvement with Tipu.

Fifth premise: Improvement is part of a professional’s day job

Railroads work this way.  Process improvements evolve over time on the job.    The only time they have a formal process improvement project is for a major review: e.g. a safety campaign with experts checking the practices for safety risks; or a cost-cutting drive with time-and-motion analysts squeezing out efficiencies (we call it Lean these days).  Most of the time, middle managers and line workers talk and decide a better way as part of their day jobs, often locally and often passed on as unwritten lore.  Nobody in head office knows how each industrial track is switched (the wagons shuffled around: loads in, empties out).  The old hands teach it to the newcomers.

Most improvement is not a project.   Improvement is normal behaviour for professionals: to devote a certain percentage of our time to improving the systems we work with.  We should all expect that things will be better next year.   We should all expect that we will make a difference and leave systems better than we found them.   Improvement is part of business as usual.

As a culture, IT doesn’t take kindly to ad-hoc, local grass-roots, unmanaged improvements.  We need to get over that – we don’t have good alternatives if we are going to make progress.

Sixth premise: Software and hardware have to be near-perfect.  Practices and processes don’t.

The tolerances for the gap between wheels or rails are specified in fractions of a millimetre on high-speed track.   Even slow freight lines must be correct to a few millimetres, over the thousands of kilometres of a line.  And no the standard 4’8.5” gauge has nothing to do with Roman chariots.  It was one of many gauges in use for mine carts when George Stephenson started building railways, but his first employer happened to use 4’8”.  Sorry to spoil a good story about horse’s butts and space shuttles.

Contrast the accuracy of the technology with the practices used to operate a railroad.  In the USA, freight train arrival times cannot be predicted to the nearest half-day. (Let’s not get into a cultural debate by contrasting this with say Japanese railroads.  To some, the USA looks sloppy.  They say it is flexible.)   Often US railroads need to drive out a new crew to take over a train because the current crew have done their legally-limited 12 hours.  Train watchers will tell you that two different crews may well switch a location (shuffle the wagons about) differently.  Compared to their technology, railroads’ practices are loose.  Just like us.

In recent years railroad practices have been tightened for greater efficiency (the New Zealand Railways carry more freight now with about 11,000 staff than they once did with 55,000) and especially for greater human safety.  But practices are still not “to the nearest millimetre” by any means.

Perfection is impossible

We operate with limited resources and information in an imperfect world.  It is impossible for an organisation to improve all practices to an excellent level in a useful time.  Therefore it is essential to make the hard decisions about which ones we address.  Equally it is impossible – or at least not practical – to produce the perfect solution for each one.  In the real world we do what we can and move on.  Good enough is near enough except in clearly identified situations where Best is essential for business reasons.  Best Practice frameworks   not a blueprint: they are a comparison reference or benchmark to show what would be achieved with unlimited resources in unlimited time – they are aspirational.

Some progress is better than nothing.  If we try to take a formalised project-managed approach to service improvement, the outcome for the few aspects addressed by the projects will be a good complete solution… eventually, when the projects end, if the money holds.  Unfortunately, the outcome for the many aspects of service delivery not included in the projects’ scope is likely to be nothing.   Most organisations don’t have enough funds, people or time to do a formal project-based improvement of every aspect of service management.  Aim to address a wider scope than projects can – done less formally, less completely, and less perfectly than a project would.

We can do this by making improvements as we go, at our day jobs in BAU.  We will discuss this ‘relaxed’ approach more fully in future.

We need an improvement programme to manage the improvements we choose to make.   That programme should encompass both projects and BAU improvements.

Project management is a mature discipline

The management of projects is a mature discipline: see Prince2 and Managing Successful Programmes and Management of Portfolios and Portfolio Programme and Project Office, to name just the four bodies of knowledge from the UK Cabinet Office.

What we are not so mature about is managing improvements as part of BAU.

The public-domain Tipu method focuses on improving the creation and operation of services, not the actual service systems themselves.  The former is what BAU improvements should focus on.   i.e. Tipu improves the way services are delivered, not the functionality of the service (although  it could conceivably be used for that too).

Service owners need to take responsibility for improvements

The improvement of the actual services themselves – their quality and functionality – is the domain of the owners of the services: our IT customers.   They make those decisions to improve and they should fund them, generally as projects.

On the other hand, decisions about improving the practices we use to acquire/build and operate the IT machinery of services can be taken within IT: they are practices under our control, our authority, our accountability.  They are areas that we are expected to improve as part of our day jobs, as part of business as usual.

We’ll get into the nitty-gritty of how to do that next time.

image credit – © Tomas Sereda – Fotolia.com

Everything is improvement

Traditionally Continual Service Improvement (CSI) is too often thought of as the last bit we put in place when formalising ITSM.  In fact, we need to start with CSI, and we need to plan a whole portfolio of improvements encompassing formal projects, planned changes, and improvements done as part of business-as-usual (BAU) operations.  And the ITIL ‘process’ is the wrong unit of work for those improvements, despite what The Books tell you. Work with me here as I take you through a series of premises to reach these conclusions and see where it takes us.

In my last article, I said service portfolio management is a superset of organisational change management.  Service portfolio decisions are decisions about what new services go ahead and what changes are allowed to update existing services, often balancing them off against each other and against the demands of keeping the production services running.  Everything we change is service improvement. Why else would we do it?  If we define improvement as increasing value or reducing risk, then everything we change should be to improve the services to our customers, either directly or indirectly.
Therefore our improvement programme should manage and prioritise all change.  Change management and service improvement planning are one and the same.

Everything is improvement

First premise: Everything we change is service improvement

Look at a recent Union Pacific Railroad quarterly earnings report.  (The other US mega-railroad, BNSF, is now the personal train-set of Warren Buffett – that’s a real man’s toy – but luckily UP is still publicly listed and tell us what they are up to).

I don’t think UP management let one group decide to get into the fracking materials business and allowed another to decide to double track the Sunset Route.  Governors and executive management have an overall figure in mind for capital spend.   They allocate that money across both new services and infrastructure upgrades.

They manage the new and existing services as a portfolio.  If the new fracking sand traffic requires purchase of a thousand new covered hoppers then the El Paso Intermodal Yard expansion may have to wait.  Or maybe they borrow the money for the hoppers against the expected revenues because the rail-yard expansion can’t wait.  Or they squeeze operational budgets.  Either way the decisions are taken holistically: offsetting new services against BAU and balancing each change against the others.

Our improvement programme should manage and prioritise all change, including changes to introduce or upgrade (or retire) services, and changes to improve BAU operations.  Change management and service portfolio management are both aspects of the same improvement planning activity.  Service portfolio management makes the decisions; change management works out the details and puts them into effect.

It is all one portfolio

Second premise: Improvement planning comes first

Our CSI plan is the FIRST thing we put together, not some afterthought we put in place after an ‘improvement’ project or – shudder – ‘ITIL Implementation’ project.
UP don’t rush off and do $3.6 billion in capital improvements then start planning the minor improvements later.  Nor do they allow their regular track maintenance teams to spend any more than essential on the parts of the Sunset Route that are going to be torn up and double tracked in the next few years.  They run down infrastructure that they know is going to be replaced.  So the BAU improvements have to be planned in conjunction with major improvement projects.  It is all one portfolio, even if separate teams manage the sub-portfolios.  Sure miscommunications happen in the real world, but the intent is to prevent waste, duplication, shortages and conflicts.

Welcome to the real world

Third premise: we don’t have enough resource to execute all desired improvements

In the perfect world all trains would be controlled by automated systems that flawlessly controlled them, eliminating human error, running trains so close they were within sight of each other for maximum track utilisation, and never ever crashing or derailing a train.  Every few years governments legislate towards this, because political correctness says it is not enough to be one of the safest modes of transport around: not even one person may be allowed to die, ever.  The airlines can tell a similar story.   This irrational decision-making forces railroads to spend billions that otherwise would be allocated to better trackwork, new lines, or upgraded rolling stock and locos.  The analogy with – say – CMDB is a strong one: never mind all the other clearly more important projects, IT people can’t bear the idea of imperfect data or uncertain answers.
Even if our portfolio decision-making were rational, we can’t do everything we’d like to, in any organisation.  Look at a picture of all the practices involved in running IT

You can’t do everything

The meaning of most of these labels should be self-evident.  You can find out more here.  Ask yourself which of those activities (practices, functions, processes…  whatever you want to call them) which of them could use some improvement in your organisation.  I’m betting most of them.
So even without available funds being gobbled up by projects inspired by political correctness, a barmy new boss, or a genuine need in the business, what would be the probability of you getting approval and money for projects to improve all of them?  Even if you work at Google and money is no problem, assuming a mad boss signed off on all of them what chance would you have of actually getting them all done?  Hellooooo!!!

What are we doing wrong?

Fourth premise: there is something very wrong with the way we approach ITSM improvement projects, which causes them to become overly big and complex and disruptive.  This is because we choose the wrong unit of work for improvements.

How to cover everything that needs to be looked at?  The key word there is ‘needs’.  We should understand what are our business goals for service, and derive from those goals what are the required outcomes from service delivery, then focus on improvements that deliver those required outcomes … and nothing else.

One way to improve focus is to work on smaller units than a whole practice.  A major shortcoming of many IT service management projects is that they take the ITIL ‘processes’ as the building blocks of the programme.  ‘We will do Incident first’.  ‘We can’t do Change until we have done Configuration’.  Even some of the official ITIL books promote this thinking.

Put another way, you don’t eat an elephant one leg at a time: you eat it one steak at a time… and one mouthful at a time within the meal.  Especially when the elephant has about 80 legs.

Don’t eat the whole elephant

We must decompose the service management practices into smaller, more achievable units of work, which we assemble Lego-style into a solution to the current need.  The objective is not to eat the elephant, it is to get some good meals out of it.
Or to get back to railroads: the Sunset Route is identified as a critical bottleneck that needs to be improved, so they look at trackwork, yards, dispatching practices, traffic flows, alternate routes, partner and customer agreements…. Every practice of that one part of the business is considered.  Then a programme of improvements is put in place that includes a big capital project like double-tracking as much of it as is essential; but also includes lots of local minor improvements across all practices – not improvements for their own sake, not improvements to every aspect of every practice, just a collection of improvements assembled to relieve the congestion on Sunset.

Make improvement real

So take these four premises and consider the conclusions we can draw from them:

  1. Everything we change is service improvement.
  2. Improvement planning comes first.
  3. We don’t have enough resource to execute all desired improvements.
  4. We choose the wrong unit of work for improvements.

We should begin our strategic planning of operations by putting in place a service improvement programme.  That programme should encompass all change and BAU: i.e. it manages the service portfolio.

The task of “eating 80-plus elephant’s legs” is overwhelming. We can’t improve everything about every aspect of doing IT.   Some sort of expediency and pragmatism is required to make it manageable.  A first step down that road is to stop trying to fix things practice-by-practice, one ITIL “process” at a time.

Focus on needs

We must focus on what is needed.  To understand the word ‘needed’ we go back to the desired business outcomes.  Then we can make a list of the improvement outputs that will deliver those outcomes, and hence the pieces of work we need to do.

Even then we will find that the list can be daunting, and some sort of ruthless expediency will have to be applied to choose what does and doesn’t get done.

The other challenge will be resourcing the improvements, no matter how ruthlessly we cut down the list.  Almost all of us work in an environment of shrinking budgets and desperate shortages of every resource:  time , people and money.  One way to address this– as I’ve already hinted – is to do some of the work as part of BAU.

These are all aspects of my public-domain improvement planning method, Tipu:

  • Alignment to business outcomes
  • Ruthless decision making
  • Doing much of the work as part of our day jobs

More of this in my next article when we look closer at the Tipu approach.

Service Improvement at Cherry Valley

Problem, risk, change , CSI, service portfolio, projects: they all make changes to services.  How they inter-relate is not well defined or understood.  We will try to make the model clearer and simpler.

Problem and Risk and Improvement

The crew was not warned of the severe weather ahead

In this series of articles, we have been talking about an ethanol train derailment in the USA as a case study for our discussions of service management.  The US National Transport Safety Board wrote a huge report about the disaster, trying to identify every single factor that contributed and to recommend improvements.  The NTSB were not doing Problem Management at Cherry Valley.  The crews cleaning up the mess and rebuilding the track were doing problem management.  The local authorities repairing the water reservoir that burst were doing problem management.  The NTSB was doing risk management and driving service improvement.

Arguably, fixing procedures which were broken was also problem management.   The local dispatcher failed to tell the train crew of a severe weather warning as he was supposed to do, which would have required the crew to slow down and watch out.  So training and prompts could be considered problem management.

But somewhere there is a line where problem management ends and improvement begins, in particular what ITIL calls continual service improvement or CSI.

In the Cherry Valley incident, the police and railroad could have communicated better with each other.  Was the procedure broken?  No, it was just not as effective as it could be.  The type of tank cars approved for ethanol transportation were not required to have double bulkheads on the ends to reduce the chance of them getting punctured.  Fixing that is not problem management, it is improving the safety of the tank cars.  I don’t think improving that communications procedure or the tank car design is problem management, otherwise if you follow that thinking to its logical conclusion then every improvement is problem management.

A distinction between risks and problems

But wait: unreliable communications procedure and the single-skinned tank cars are also risks.  A number of thinkers, including Jan van Bon, argue that risk and problem management are the same thing.  I think there is a useful distinction: a problem is something that is known to be broken, that will definitely cause service interruptions if not fixed; a “clear and present danger”.  Risk management is something much broader, of which problems are a subset.  The existence of a distinct problem management practice gives that practice the focus it needs to address the immediate and certain risks.

(Risk is an essential practice that ITIL – strangely – does not even recognise as a distinct practice; the 2011 edition of ITIL’s Continual Service Improvement book attempts to plug this hole.  COBIT does include risk management, big time.  USMBOK does too, though in its own distinctive  way it lumps risk management under Customer services; I disagree: there are risks to our business too that don’t affect the customer.)

So risk management and problem management aren’t the same thing.  Risk management and improvement aren’t the same thing either.  CSI is about improving the value (quality) as well as reducing the risks.

To summarise all that: problem management is part of risk management which is part of service improvement.

Service Portfolio and Change

Now for another piece of the puzzle.  Service Portfolio practice is about deciding on new services, improvements to services, and retirement of services.  Portfolio decisions are – or should be – driven by business strategy: where we want to get to, how we want to approach getting there, what bounds we put on doing that.

Portfolio decisions should be made by balancing value and risk.  Value is benefits  minus  costs.  There is a negative benefit and a set of risks associated with the impact on existing services of building a new service:  there is the impact of the project dragging people and resources away from production, and the ongoing impact of increased complexity, the draining of shared resources etc….  So portfolio decisions need to be made holistically, in the context of both the planned and live services.  And in the context of retired services too: “tell me again why we are planning to build a new service that looks remarkably like the one we killed off last year?”.  A lot of improvement is about capturing the  learnings of the past.

Portfolio management is a powerful technique that is applied at mulltiple levels.  Project and Programme Portfolio Management is all the rage right now, but it only tells part of the story.  Managing projects in programmes and programmes in portfolios only manages the changes that we have committed to make; it doesn’t look at those changes in the context of existing live services as well.  When we allocate resources across projects in PPM we are not looking at the impact on business-as-usual (BAU); we are not doling out resources across projects and BAU froma  single pool.  That is what a service portfolio gives us:  the truly holistic picture of all the effort  in our organisation across change and BAU.

A balancing act

Service portfolio management is a superset of organisational change management.  Portfolio decisions are – or should be – decisions about what changes go ahead for new services and what changes are allowed to update existing services, often balancing them off against each other and against the demands of keeping the production services running.  “Sure the new service is strategic, but the risk of not patching this production server is more urgent and we can’t do both at once because they conflict, so this new service must wait until the next change window”.  “Yes, the upgrade to Windows 13 is overdue, but we don’t have enough people or money to do it right now because the new payments system must go live”.  “No, we simply cannot take on another programme of work right now: BAU will crumble if we try to build this new service before we finish some of these other major works”.

Or in railroad terms: “The upgrade to the aging track through Cherry Valley must wait another year because all available funds are ear-marked for a new container terminal on the West Coast to increase the China trade”.  “The NTSB will lynch us if we don’t do something about Cherry Valley quickly.  Halve the order for the new double-stack container cars”.

Change is service improvement

Everything we change is service improvement. Why else would we do it?  If we define improvement as increasing value or reducing risk, then everything we change should be to improve the services to our customers, either directly or indirectly.

Therefore our improvement programme should manage and prioritise all change.  Change management and service improvement planning are one and the same.

So organisational change management is CSI. They are looking at the beast from different angles, but it is the same animal.  In generally accepted thinking, organisational change practice tends to be concerned with the big chunky changes and CSI tends to be focused more on the incremental changes.  But try to find the demarcation between the two.   You can’t decide on major change without understanding the total workload of changes large and small.  You can’t plan a programme of improvement work for only minor improvements without considering what major projects are planned or happening.

In summary, change/CSI  is  one part of service portfolio management which also considers delivery of BAU live services.  A railroad will stop doing minor sleeper (tie) replacements and other track maintenance when they know they are going to completely re-lay or re-locate the track in the near future.  After decades of retreat, railroads in the USA are investing in infrastructure to meet a coming boom (China trade, ethanol madness, looming shortage of truckers); but they better beware not to draw too much money away from delivering on existing commitments, and not to disrupt traffic too much with major works.

Simplifying service change

ITIL as it is today seems to have a messy complicated story about change.  We have a whole bunch of different practices all changing our services, from  Service Portfolio to Change Management to Problem Management to CSI.  How they relate to each other is not entirely clear, and how they interact with risk management or project management is undefined.

There are common misconceptions about these practices.  CSI is often thought of as “twiddling the knobs”, fine-tuning services after they go live.  Portfolio management is often thought of as being limited to deciding what new services we need.  Risk management is seen as just auditing and keeping a list.  Change Management can mean anything from production change control to organisational transformation depending on who you talk to.

It is confusing to many.  If you agree with the arguments in this article then we can start to simplify and clarify the model:

Rob England: ITSM Model
I have added in Availability, Capacity, Continuity, Incident and Service Level Management practices as sources of requirements for improvement.  These are the feedback mechanisms from operations.  In addition the strategy, portfolio and request practices are sources of new improvements.   I’ve also placed the operational change and release practices in context as well.

These are merely  the thoughts of this author.  I can’t map them directly to any model I recall, but I am old and forgetful.  If readers can make the connection, please comment below.

Next time we will look at the author’s approach to CSI, known as Tipu.

Image credit: © tycoon101 – Fotolia.com

Root Cause – Railways don't like derailments

Most readers have got the story now from my recent articles: Cherry Valley, Illinois, 2009, rain bucketing down, huge train-load of ethanol derails, fire, death, destruction.

Eventually the Canadian National’s crews and the state’s emergency services cleaned up the mess, and CN rebuilt the track-bed and the track, and trains rolled regularly through Cherry Valley again.

Then the authorities moved in to find out what went wrong and to try to prevent it happening again.  In this case the relevant authority is the US National Transportation Safety Board (NTSB).

Keep asking why?

We have a fascination with finding a root cause - chances are that there is not just one

Every organisation should have a review process for requests, incidents, problems and changes, with some criteria for triggering a review.

In this case it was serious enough that an external agency reviewed the incident.  The NTSB had a good look and issued a report.  Read it as an example of what a superb post-incident review looks like.  Some of our major IT incidents involve as much financial loss as this one and sadly some also involve loss of life.

IT has a fascination with “root cause”.  Root Cause Analysis (RCA) is a whole discipline in its own right.  The Kepner-Tregoe technique (ITIL Service operation 2011 Appendix C) calls it “true cause”.

The rule of thumb is to keep asking “Why?” until the answers aren’t useful any more, then that – supposedly – is your root cause.

This belief in a single underlying cause of things going wrong is a misguided one.  The world doesn’t work that way – it is always more complex.

The NTSB found a multitude of causes for the Cherry Valley disaster.   Here are just some of them:

  • It was extreme weather
  • The CN central rail traffic controller (RTC) didn’t put out a weather warning to the train crew which would have made them slow down, although required to do so and although he was in radio communication with the crew
  • The RTC did not notify track crews
  • The track inspector checked the area at 3pm and observed no water build-up
  • Floodwater washed out a huge hole in the track-bed under the tracks, leaving the rails hanging in the air.
  • Railroads normally post their contact information at all grade crossings but the first citizen reporting the washout could not find the contact information at the crossing where the washout was, so he called 911
  • The police didn’t communicate well with CN about the washed out track: they first alerted two other railroads
  • There wasn’t a well-defined protocol for such communication between police and CN
  • Once CN learned of the washout they couldn’t tell the RTC to stop trains because his phone was busy
  • Although the train crew saw water up to the tops of the rails in some places they did not slow down of their own accord
  • There was a litany of miscommunication between many parties in the confusion after the accident
  • The federal standard for ethanol cars didn’t require them to be double-skinned or to have puncture-proof bulkheads (it will soon: this tragedy triggered changes)
  • There had been a previous washout at the site and a 36” pipe was installed as a relief drain for flooding.  Nobody calculated what size pipe was needed and nobody investigated where the flood water was coming from.  After the washout the pipe was never found.
  • The county’s storm-water retention pond upstream breached in the storm.  The storm retention pond was only designed to handle a “ten year storm event”.
  • Local residents produced photographic evidence that the berm and outlet of the pond had been deteriorating for several years beforehand.

OK you tell me which is the “root cause”

Causes don’t normally arrange themselves in a nice tree all leading back to one. There are several fundamental contributing causes. Anyone who watches Air Crash Investigation knows it takes more than one thing to go wrong before we have an accident.

Sometimes one of them stands out like the proverbial. So instead of calling it root cause I’m going to call it primary cause. Sure the other causes contributed but this was the biggest contributor, the primary.

Ian Clayton once told me that root cause …er… primary cause analysis is something you do after the fact as part of the review and wash-up. In the heat of the crisis who gives a toss what the primary cause is – remove the most accessible of the causes. Any disaster is based on multiple causes. Removing any one cause of an incident will likely restore service.  Then when we have time to consider what happened and take steps to prevent a recurrence, we should probably try to address all causes.  Don’t do Root cause Analysis, just do Cause Analysis, seeking the multiple contributing causes.  If we need to focus efforts then the primary cause is the one, which implies that a key factor in deciding primacy is how broad the potential is for causing more incidents.

All complex systems are broken

It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management.  You must read it.  Now.  I’ll wait.

It says that all complex systems are broken.   It is only when the broken bits line up in the right way that the system fails.

It dates from 1998!  Richard Cook is a doctor, an MD.  He seemingly knocked this paper off on his own.  It is a whole four pages long, and he wrote it with medical systems in mind.  But that doesn’t matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.

“Complex systems run as broken systems”

“Change introduces new forms of failure”

“Views of ‘cause’ limit the effectiveness of defenses against future events… likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly.”

“Failure free operations require experience with failure.”

Many times the person “to blame” for a primary cause was just doing their job. All complex systems are broken. Every day the operators make value judgements and risk calls. Sometimes they don’t get away with it. There is a fine line between considered risks and incompetence – we have to keep that line in mind. Just because they caused the incident doesn’t mean it is their fault. Think of the word “fault” – what they did may not have been faulty, it may just be what they have to do every day to get the job done. Too often, when they get away with it they are considered to have made a good call; when they don’t they get crucified.

That’s not to say negligence doesn’t happen.   We should keep an eye out for it, and deal with it when we find it.  Equally we should not set out on cause analysis with the intent of allocating blame.   We do cause analysis for the purpose of preventing a recurrence of a similar Incident by removing the existing Problems that we find.

I will close by once again disagreeing with ITIL’s idea of Problem Management.  As I said in my last article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.

It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?”  That is dealt with by Risk Management (an essential practice that ITIL does not even recognise) feeding into Continual Service Improvement to remove the risk.  The NTSB were not doing Problem Management at Cherry Valley.

Next time we will look at continual improvement and how it relates to problem prevention.

Image credit – © Van Truan – Fotolia.com

Rob England: Proactive Problem Management

Just because you rebuild the track doesn’t mean the train won’t derail again.

Rebuilding the track was reactive Problem Management

We have been looking in past articles at the tragic events in little Cherry Valley, Illinois in 2009.  One person died and several more were seriously injured when a train-load of ethanol derailed at a level crossing. We talked about the resulting Incident Management, which focused on customers, trains and cargo – ensuring the services still operated, employing workarounds. Then we considered the Problem Management: the injured people and the wreck and the broken track – removing the causes of service disruption, restoring normal service.

A Problem is a problem, whether it has caused an Incident yet or not

In a previous article I said ITIL has an odd definition of Problem.  ITIL says a Problem is the cause of “one or more incidents”.   ITIL promotes proactive (better called pre-emptive) Problem Management, and yet apparently we need to wait until something causes at least one Incident before we can start treating it as a Problem.  I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town.  A Problem is in fact the cause of zero or more Incidents.  A Problem is a problem, whether it has caused an Incident yet or not.

We talked about how I try to stick to a nice crisp simple model of Incident vs. Problem.  To me, an incident is an interruption to service and a problem is an underlying (potential) cause of incidents.  Incident Management is concerned with the restoration of expected levels of service to the users.  Problem Management is concerned with removing the underlying causes.

ITIL doesn’t see it that crisply delineated: the two concepts are muddied together.  ITIL – and many readers – would say that putting out the fires, clearing the derailed tankers, rebuilding the roadbed, and relaying the rails can be regarded as part of the Incident resolution process because the service isn’t “really” restored until the track is back.

Problems can be resolved with urgency

In the last article I said this thinking may arise because of the weird way ITIL defines a Problem.  I have a hunch that there is a second reason: people consider removing the cause of the incident to be part of the incident because they see Incident=Urgent, Problem=Slow.  They want the incident Manager and the Service Desk staff to hustle until the cause is removed.  This is just silly.   There is no reason why Problems can’t be resolved with urgency.  Problems should be categorised by severity and priority and impact just like Incidents are.  The Problem team should go into urgent mode when necessary to mobilise resources, and the Service Desk are able to hustle the Problem along just as they would an Incident.

This inclusion of cause-removal over-burdens and de-focuses the Incident Management process.  Incident Management should have a laser focus on the user and by implication the customer.  It should be performed by people who are expert at serving the user.  Its goal is to meet the user’s needs.   Canadian National’s incident managers were focused on getting deliveries to customers despite a missing bit of track.

Problem Management is about fixing faults.  It is performed by people expert at fixing technology.  .  The Canadian National incident managers weren’t directing clean-up operations in Cherry Valley: they left that to the track engineers and the emergency services.

Problem management is a mess

But the way ITIL has it, some causes are removed as part of Incident resolution and some are categorised as Problems, with the distinction being unclear (“For some incidents, it will be appropriate…” ITIL Service Operation 2011 4.2.6.4).  The moment you make Incident Management responsible for sometimes fixing the fault as well as meeting the user’s needs, you have a mashup of two processes, with two sometimes-conflicting goals, and performed by two very different types of people.  No wonder it is a mess.

It is a mess from a management point of view when we get a storm of incidents.  Instead of linking all related incidents to an underlying Problem, we relate them to some “master incident” (this isn’t actually in ITIL but it is common practice) .

It is a mess from a prioritisation point of view.   The poor teams who fix things are now serving two processes:  Incident and Problem.  In order to prioritise their work they need to track a portfolio of faults that are currently being handled as incidents and faults that are being handled as problems, and somehow merge a holistic picture of both.  Of course they don’t.   The Problem Manager doesn’t have a complete view of all faults nor does the Incident Manager, and the technical teams are answerable to both.

It is a mess from a data modelling point of view as well.  If you want to determine all the times that a certain asset broke something, you need to look for incidents it caused and problems it caused

Every cause of a service impact (or potential impact) should be recorded immediately as a problem, so we can report and manage them in one place.

All that tirade is by way of introducing the idea of reactive and proactive Problem Management.

Cherry Valley needed Reactive Problem Management

Reactive Problem Management responds to an incident to remove the cause of the disruption to service.  The ITIL definition is more tortuous because it treats “restoring the service” as Incident Management’s job, but it ends up saying a similar thing: “Reactive problem management is concerned with solving problems in response to one or more incidents” (SO 2011 4.4.2).

Pro-active Problem Management fixes problems that aren’t currently causing an incident to prevent them causing incidents (ITIL says “further” incidents).

So cleaning up the mess in Cherry Valley and rebuilding the track was reactive Problem Management.

Once the trains were rolling they didn’t stop there.  Clearly there were some other problems to address.  What caused the roadbed to be washed away in the first place?  Why did a train thunder into the gap at normal track speed?  Why did the tank-cars rupture and how did they catch fire?

Find the problems that need fixing

In Cherry Valley, the drainage was faulty.  Water was able to accumulate behind the railway roadbed embankment, causing flooding and eventually overflowing the roadbed, washing out below the track, leaving rails dangling in the air.  The next time there was torrential rain, it would break again.  That’s a problem to fix.

Canadian National’s communication processes were broken.  The dispatchers failed to notify the train crew of a severe weather alert, which they were supposed to do.  If they had, the train would have operated at reduced speed.  That’s a problem to fix.

The CN track maintenance processes worked, perhaps lackadaisically but they worked as designed.  The processes could have been a lot better, but were they broken?  No.

The tank cars were approved for transporting ethanol.   Those were not required to be equipped with head shields (extra protection at the ends of the tank to resist puncturing), jackets, or thermal protection.  In March 2012 the US National Transportation Safety Board (NTSB) recommended (R-12-5 ) “that all newly manufactured and existing general service tank cars authorized for transportation of denatured fuel ethanol … have enhanced tank head and shell puncture resistance systems”.  The tank-cars weren’t broken (before the crash).  This is not fixing a problem; it is improving the safety to mitigate the risk of rupture.

Proactive Problem Management prevents the recurrence of Incidents

I don’t think pro-active Problem Management is about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.  That is once again over-burdening a process.  If you delve too far into preventing future problems, you cross over into Availability and Capacity and Risk Management and Service Improvement, (and Change Management!), not Problem Management.

ITIL agrees: “Proactive problem management is concerned with identifying and solving problems and known errors before further incidents related to them can occur again”.   Proactive Problem Management prevents the recurrence of Incidents, not Problems.

In order to ensure that incidents will not recur, we need to dig down to find all the underlying causes.  In many methodologies we go after that mythical beast, the Root Cause.  We will talk about that next time.

Image credit

Rob England: Problem Management Defined

Problem Management DefinedRailways (railroads) remind us of how the real world works.

In our last article, we left Cherry Valley, Illinois in its own little piece of hell.

For those who missed the article, in 2009 a Canadian National railroad train carrying eight million litres of ethanol derailed at a level crossing in the little town of Cherry Valley after torrential rain washed out the roadbed beneath the track. 19 tankers of ethanol derailed, 13 of them split or spilled, and the mess somehow caught fire in the downpour.

One person in the cars waiting at the crossing died and several more were seriously injured.

Incidents vs. Problems

In that previous article we looked at the Incident Management. As I said then, an incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes. I also mentioned that ITIL doesn’t see it that crisply delineated. Anyway, let us return to Cherry Valley…

One group of people worked inside office buildings making sure the trains kept rolling around the obstruction so that the railroad met its service obligations to its users. This was the Incident Management practice: restoring service to the users, focusing on perishable deliveries such as livestock and fruit.

Another group thrashed around in the chaos that was Cherry Valley, trying to fix a situation that was very very broken. Their initial goal was containment: save and treat people in vehicles, evacuate surrounding houses, stop the fire, stop the spills, move the other 100 tank-cars of ethanol away, get rid of all this damn flooding and mud.

The Shoo-fly

The intermediate goal was repair and restore: get trains running again. Often this is done with a “shoo-fly”: a temporary stretch of track laid around the break, which trains inch gingerly across whilst more permanent repairs are effected. This is not a Workaround as we use the term in ITSM. The Workaround was to get trains onto alternate routes or pass freight to other companies. A shoofly is temporary infrastructure: it is part of the problem fix just as a temporary VM server instance would be. While freight ran on other roads or on a shoofly, they would crane the derailed tankers back onto the track or cart them away, then start the big job of rebuilding the road-base that had washed away – hopefully with better drains this time – and relaying the track. Compared to civil engineering our IT repairs look quick, and certainly less strenuous.

Which brings us to the longer-term goal: permanent remediation of the problem. Not only does the permanent fix include new rail roadbed and proper drainage; the accident report makes it clear that CN’s procedures and communications were deficient as well. Cherry Valley locals were calling 911 an hour beforehand to report the wash-out.

Damage Limitation

We will talk more about the root causes and long term improvement later. Let’s stay in Cherry Valley for now. It is important to note that the lives and property the emergency responders were saving were unconnected to the services, users or customers of the railroad. All the people working on all these aspects of the problem had only a secondary interest in the timeliness of pigs and oranges and expensive petrol. They were not measured on freight delivery times: they were measured on speed, quality and permanence of the fix, and prevention of any further damage.

If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. To me it is pretty clear: Incident and Problem practices are delineated by different activities, teams, skills, techniques, tools, goals and metrics. Incident: user service levels. Problem: causes.

While I am arguing with ITIL definitions, let’s look at another aspect of Incidents. ITIL says that something broken is an Incident if it could potentially cause a service interruption in future. Once again this ignores the purpose, roles, skills and tools of Incident Management and Problem Management. Such a fault is clearly a Problem, a (future) cause of an Incident.

(Incidentally, it is hard to imagine many faults in IT that aren’t potentially the cause of a future interruption or degradation of service. If we follow this reasoning to its absurd conclusion, every fault is an incident and nothing is a problem).

Perhaps one reason ITIL hangs these “potential incidents” where it does is because of another odd definition: ITIL says a Problem is the cause of “one or more incidents”. What’s odd about that? ITIL promotes pro-active (better called pre-emptive) problem management, and yet apparently we need to wait until something causes at least one incident before we can start treating it as a problem. I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town. (Actually ITIL lost proactive problem management from ITIL V3 but it was hastily restored in ITIL 2011).

Human Eyeball

One of my favourite railroad illustrations is about watching trains. When a train rolls by, keep an eye on nearby staff: those on platforms, down by the track, on waiting trains. On most railroads, staff will stop what they are doing and watch the train – the whole train, watching until it has all gone by. In the old days they would wave to the guard (conductor) on the back of the train. Nowadays they may say something to the driver via radio.

Laziness? Sociability? Railfans? Possibly. But quite likely it is part of their job – it may well be company policy that everybody watches every passing train. The reason is visual inspection. Even in these days of radio telemetry from the FRED (Flashing Rear End Device, a little box on the back that replaces the caboose/guardsvan of old) and track-side detectors for cracked wheels and hotboxes (overheating bearings), there is still no substitute for the good old human eyeball for spotting anything from joyriders to dragging equipment. It is everyone’s responsibility to watch and report: not a bad policy in IT either.

What they are spotting are Problems. The train is still rolling so the service hasn’t been interrupted … yet.

Other Problems make themselves known by interrupting the service. A faulty signal stops a train. In the extreme case the roadbed washes away. We can come up with differing names for things that have and haven’t interrupted/degraded service yet, but I think that is arguing about angels dancing on pinheads. They are all Problems to me: the same crews of people with heavy machinery turn out to fix them while the trains roll by delivering they care not what to whom. Oh sure, they have a customer focus: they care that the trains are indeed rolling and on time, but the individual service levels and customer satisfaction are not their direct concern. There are people in cozy offices who deal with the details of service levels and incidents.
Next time we will return to the once-again sleepy Cherry Valley to discuss the root causes of this accident.

Rob England: Incident Management at Cherry Valley, Illinois

It had been raining for days in and around Rockford, Illinois that Friday afternoon in 2009, some of the heaviest rain locals had ever seen. Around 7:30 that night, people in Cherry Valley – a nearby dormitory suburb – began calling various emergency services: the water that had been flooding the road and tracks had broken through the Canadian National railroad’s line, washing away the trackbed.

An hour later, in driving rain, freight train U70691-18 came through the level crossing in Cherry Valley at 36 m.p.h, pulling 114 cars (wagons) mostly full of fuel ethanol – 8 million litres of it – bound for Chicago. Although ten cross-ties (sleepers) dangled in mid air above running water just beyond the crossing, somehow two locomotives and about half the train bounced across the breach before a rail weld fractured and cars began derailing. As the train tore in half the brakes went into emergency stop. 19 ethanol tank-cars derailed, 13 of them breaching and catching fire.

In a future article we will look at the story behind why one person waiting in a car at the Cherry Valley crossing died in the resulting conflagration, 600 homes were evacuated and $7.9M in damages were caused.

Today we will be focused on the rail traffic controller (RTC) who was the on-duty train dispatcher at the CN‘s Southern Operations Control Center in Homewood, Illinois. We won’t be concerned for now with the RTC’s role in the accident: we will talk about that next time. For now, we are interested in what he and his colleagues had to do after the accident.

While firemen battled to prevent the other cars going up in what could have been the mother of all ethanol fires, and paramedics dealt with the dead and injured, and police struggled to evacuate houses and deal with the road traffic chaos – all in torrential rain and widespread surface flooding – the RTC sat in a silent heated office 100 miles away watching computer monitors. All hell was breaking loose there too. Some of the heaviest rail traffic in the world – most of it freight – flows through and around Chicago; and one of the major arteries had just closed.

Back in an earlier article we talked about the services of a railroad. One of the major services is delivering goods, on time. Nobody likes to store materials if they can help it: railroads deliver “just in time”, such as giant ethanol trains, and the “hotshot” trans-continental double-stack container trains with nine locomotives that get rail-fans like me all excited. Some of the goods carried are perishables: fruit and vegetables from California, stock and meat from the midwest, all flowing east to the population centres of the USA.

The railroad had made commitments regarding the delivery of those goods: what we would call Service Level Targets. Those SLTs were enshrined in contractual arrangements – Service Level Agreements – with penalty clauses. And now trains were late: SLTs were being breached.

A number of RTCs and other staff in Homewood switched into familiar routines:

  • The US rail network is complex – a true network. Trains were scheduled to alternate routes, and traffic on those routes was closed up as tightly bunched together as the rules allowed to create extra capacity.
  • Partner managers got on the phone to the Union Pacific and BNSF railroads to negotiate capacity on their lines under reciprocal agreements already in place for situations just such as this one.
  • Customer relations staff called clients to negotiate new delivery times.
  • Traffic managers searched rail yard inventories for alternate stock of ethanol, that could be delivered early.
  • Crew managers told crews to pick up their trains in new locations and organised transport to get them there.

Fairly quickly, service was restored: oranges got squeezed in Manhatten, pigs and cows went to their deaths, and corn hootch got burnt in cars instead of all over the road in Cherry Valley.

This is Incident Management.

None of it had anything to do with what was happening in the little piece of hell that Cherry Valley had become. The people in heavy waterproofs, hi-viz and helmets, splashing around in the dark and rain, saving lives and property and trying to restore some semblance of local order – that’s not Incident Management.

At least I don’t think it is. I think they had a problem.

An incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.

To me that is a simple definition that works well. If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. If the customer gets steak and orange juice then Cherry Valley could be still burning for all they care: Incident Management has met its goals.

Image Credit

Rob England: The People in ITSM

Maori proverb: "He aha te mea nui? He tangata. He tangata. He tangata." What is the most important thing? It is people, it is people, it is people.

It’s all about the people.

A service exists to serve people.  It is built and delivered by people.

Even in the most technical domains like IT, the service is about managing information for people to use, and managing the way they use it.

When we change IT, a lot of the time we are ultimately changing the way people behave, the way they do things.

There is an old mantra “People, Process, Technology” to which I always add “…in that order”.  By which I mean prioritise in that order, and start in that order.

People, Practices, Things.

Actually I don’t like that mantra; I prefer “People, Practices, Things” as a broader, more inclusive set.  Either way, it all starts with people.

We’ve been using railways (railroads) as examples for this series of articles.  Ask a railway how important the people are.  Railways are complex and very dynamic: they need to adapt to constantly changing conditions, on a daily basis and across the decades.  We are slowly getting to the point where we can automate the running of railways, but only because the trains run in a tightly designed, constructed and operated environment that relies on people to make it work and keep it safe.  Much like IT.

I’ve never bought into this feel-good stuff about successful companies being dependent on a caring people culture.  Some of the most successful railroads in the ultra-competitive US environment have pretty rough people cultures – they treat their staff like cattle.   And other railroads are good to their people – though most of them are off to what we would consider the tough end of the spectrum.  I don’t think it correlates.  I could say the same about software companies I have worked for:  from second family to sweatshop.

However it is probably true that all successful companies have a strong culture.  Staff know how it works.  They may or may not like the culture but if it is strong they identify with it and align to it, to some extent.  So culture is important.

And cultural change is hard – in fact it is a black art.  The bad news is that changing the way people behave – remember our first paragraph? – is cultural change.  Behaviours only change permanently if the underlying attitudes change.  And people’s attitudes only change if their beliefs move at least a little bit.  Culture change.  Fifty years ago railroads were places where men – all men – died regularly, learned on the job, and fought as hard as they worked.  Now the people are trained professionals and the first priority of most railroads is safety.  Twenty years ago the New Zealand Railways had 56,000 employees, couldn’t move anything without losing it, lost millions, and wouldn’t know what to do with a container.  Now 11,000 move record volumes of freight and do it profitably.

“Just because you can change software in seconds doesn’t mean organisational change happens like that”

You can’t make those transformations in short timeframes.  Just because you can change software in seconds doesn’t mean organisational change happens like that.  You would think railroads take longer to change hardware technology than we do in IT because it is all big chunky stuff, but really our hardware and software platforms change at about the same pace; years and even decades.   Plenty of Windows XP still around.

Technology is the fast changer compared to people and process.  Just because you rolled a flash new technology out doesn’t mean people are going to start using it any differently unless you ensure they change and their processes change.  That human rate of change is slow.  People  will change quickly in response to external pressures like war or threatening managers, but that change won’t stick until their attitudes and beliefs shift.  I bet the safety culture on US railroads took at least one generational cycle to really embed.

In response to a few high-profile crashes, governments in the US, UK and other places have mandated the introduction of higher levels of automation in train control over recent decades (despite the much higher carnage on the roads but that’s another discussion).  Much of this push for automation stems from frustration over driving change in behaviours.  Does any of this remind you of IT initiatives like DevOps?

Culture can change, and sometimes it can change quite quickly, by human standards.  It takes strong and motivational leadership, a concerted programme, and some good cultural change science.  The leading set of ideas are those of John Kotter and his Eight Steps to change, but there are many ideas and models now in this area.  In IT, everyone should read Balanced Diversity by Karen Ferris.  And you will find a multitude of suggestions on my free site He Tangata.

Whatever methods you use for change, pay attention to three aspects:  motivation, communication and development.

Motivate them in these ways:

  1. by getting them involved and consulted;
  2. by showing how they benefit from the change;
  3. by making them accountable and measuring that accountability;
  4. and by incenting them.

Communicate early, communicate often, and be as transparent about decision-making as you can.  Tough decisions are more palatable if people understand why.  Communication is two-way: consult, solicit feedback (including anonymous), run workshops and town-halls.

Development is not just one training course.   Training should be followed up, refreshed, and repeated for new entrants. Training is not enough: practical workshops, on-the-job monitoring, coaching support, local super-users and many other mechanisms all help people learn what they need to make change successful.

One final thought: examine your current and planned IT projects, and work out how much effort and money is being spent on the people aspects of the changes the project wants to achieve.  I’d love to see some comparative research on the proportions of money spent on people aspects of projects in different industries like railroading, because we in IT still seem to suffer the delusion that we work with information and technology.

Rob England: What is a Technical Service Catalogue?

Amtrak 14th Street Coach Yard (Chicago, IL, US): A railway provides other functions: track gangs who maintain the trackwork, dispatchers who control the movement of trains, yard crews who shuffle and shift rolling stock. It is clear that these are not services provided by the railway to its customers. They are internal functions.

We are looking at railways (railroads) as a useful case study for talking about service management.

Last time we looked at the service catalogue of a railway.

We concluded that first and foremost, a service catalogue describes what a service provider does.

How often and what flavour are only options to a service or package of services.

ITIL refers to a technical service catalogue (TSC).  Where does that fit?

One thing everyone agrees on is the audience: a TSC is for the internal staff of the service provider, to provide them with supplementary information about services – technical information – that the customers and users don’t need to see.

But the scope of a TSC – what services go into it – is a source of much debate, which can be crudely categorised into two camps:

  1. TSC is a technical view of the service catalogue
  2. TSC is a catalogue of technical services

Those are two very different things.  Let me declare my position up front: I believe the answer is #1, a technical view of the service catalogue.  ITIL V3 was ambiguous but ITIL 2011 comes down clearly with #2.  This is unfortunate, as we’ll discuss.

Go back to what a service catalogue is: a description of what a service provider provides to their customers (and hence their users).  A good way of thinking of a service in this context is as something that crosses a boundary: we treat the service provider as a black box, and the services are what come out of that box.  A service catalogue is associated with a certain entity, and it describes the services that cross the boundary of that entity.  If they don’t come out, they aren’t a service, for that entity, depending on where we chose to draw the boundary.  To define what the services are, first define the boundary of the service provider.

Think of our railroad example from last time.  A railway’s service catalogue is some or all of:

  • Container transport
  • Bulk goods transport (especially coal, stone and ore)
  • Less-than-container-load (parcel) transport
  • Priority and perishables transport (customers don’t send fruit as regular containers or parcels: they need it cold and fast)
  • Door-to-door (trucks for the “last mile”)
  • Livestock transport
  • Passenger transport
  • etc etc

A railway provides other functions:

  • track gangs who maintain the trackwork
  • dispatchers who control the movement of trains
  • yard crews who shuffle and shift rolling stock within the yard limits
  • hostlers who prepare and park locomotives

It is clear that these are not services provided by the railway to its customers.  They are internal functions.

A railway provides track, rolling stock, tickets and stations, but these aren’t services either: they are equipment to support and enable the services.

A passenger railway provides

  • on train security
  • ticket collectors
  • porters
  • dining car attendants
  • passenger car cleaners

and a freight railway provides

  • container loading
  • consignment tracking
  • customs clearance
  • waybill paperwork

These all touch the user or customer, so are these services?  Not unless the customer pays for them separately as services or options to services.  In general these systems are just components of a service which the customer or user happens to be able to see.

So why then do some IT people insist that a technical service catalogue should list such “services” as networks, security or AV? (ITIL calls these “supporting services”). If the networks team wants to have their own catalogue of the services that only they provide, then they are drawing their own boundary around just their function, in which case it is not part of a technical service catalogue for all of IT, it is a service catalogue specifically for the networking team.  It is not a service provided by IT to the customer.

A technical service catalogue should be a technical view of the same set of services as any other type of service catalogue for the particular entity in question.   The difference is that it provides an internal technical view of the services, with additional information useful to technical staff when providing or supporting the services.  It includes information a customer or user doesn’t want or need to see.

A technical service catalogue for a railway would indeed refer to tickets and porters and stations and yard procedures and waybills, but only as components of the services provided – referred to within the information about those services – not listed as services in their own right.  I’m all for “supporting services” to be described within a service catalogue, but not as services.  They are part of the information about a service.  Supporting services aren’t services: they are component systems – CIs – underpinning the real services we deliver to our customers.

By adopting the concept of “supporting services” and allowing these to be called services within the catalogue of a wider entity that does not provide these services to a customer, ITIL 2011 contradicts its own description of “service”.

Service Design 4.2.4.3 says:

Supporting services IT services that support or ‘underpin’ the customer-facing services.  These are typically invisible to the customer… such as infrastructure services, network services, application services or technical services.

Yet the in the prior section 4.2.4.2, such IT systems are clearly not a service:

IT staff often confuse a ‘service’ as perceived by the customer with an IT system.  In many cases one ‘service’ can be made up of other ‘services’ and so on, which are themselves made up of one or more IT systems within an overall infrastructure…  A good starting point is often to ask customers what IT services they use and how those services map onto and support their business processes

And of course it contradicts the generic ITIL definition of a service as something that delivers value to customers.  This is important because the concept of “supporting service” allows internal units within the service provider to limit their care and concern to focus on the “supporting service” they provide and allows them to become detached from the actual services provided to the customer.  There is no SLA applicable to “their” service, and it quite likely isn’t considered by service level reporting.

A railway ticket inspector shouldn’t ignore security because that is not part of his ‘service”.  A yard hostler should make sure he doesn’t obstruct the expeditious handling of rolling stick when moving locomotives, even though rolling stock isn’t part of “his” service.  The idea of “supporting service” allows and encourages an “I’m alright Jack” mentality which goes against everything we are trying to achieve with service management.

It is possible that Lou Hunnebeck and the team writing Service Design agree with me: that they intend there to be a distinction between supporting services and IT systems.  If so, that distinction is opaque.   And they should have thought more about how the “internal” services model would be misused- the problem I’m describing was common before ITIL 2011.

There is the case where the supporting services really are services: provided to us by a third party in support of our services to our customer.  For example, a railway often pays another company:

  • to clean the carriages out
  • to provide food for the bistro car;
  • to repair rolling stock
  • to provide the trucking over “the last mile”

Where we bundle these activities as part of our service to a customer and treat them as an Underpinning Contract, then from the perspective of the services in our service catalogue – i.e from the perspective of our customer – these are not services: they are CIs that should not be catalogued here.  If this – and only this scenario – is what Service Design means by a “supporting service”, I can’t see that called out explicitly anywhere.

Technical service catalogue should be a technical view of the services that we provide to our customers.  I wish ITIL had stuck to that clear simple model of a catalogue and kept IT focused on what we are there for.

Photo Credit

Rob England: What is a Service Catalogue?

"The menu analogies we see all the time when talking about service catalogue are misleading. "
"The menu analogies we see all the time when talking about service catalogue are misleading. "

We are looking at railways (railroads) as a useful case study for talking about service management.

What is the service catalogue of a railway?

If you said the timetable then I beg to differ.  If you said one-trip, return and monthly tickets I don’t agree either.

The menu analogies we see all the time when talking about service catalogue are misleading.

A menu (or timetable) represents the retail consumer case: where the customer and the user are one.  In many scenarios we deal with in business, the one paying is not the one consuming.

The service catalogue describes what the customer can buy.  The request catalogue is what the user can request.  Consider a railroad cook-wagon feeding a track crew out in the wilds: the cook decides with the railroad what to serve; the staff get a choice of two dishes.

The cook’s services are:

  • Buying and delivering and storing ingredients
  • Mobile cooking and eating facilities
  • Cooking food
  • Serving food onsite

That is the service catalogue.  The railway can choose to buy some or all of those services from the caterer, or to go elsewhere for some of them.

The menu is a service package option to the “cooking food” service.  The railroad chooses the options based on cost and staff morale.  The menu gives staff the illusion of choice.

First and foremost, a service catalogue describes what a service provider does. How often and what flavour are only options to a service or package of services.  A railway’s service catalogue is some or all of:

  • Container transport
  • Bulk goods transport (especially coal, stone and ore)
  • Less-than-container-load (parcel) transport
  • Priority and perishables transport (customers don’t send fruit as regular containers or parcels: they need it cold and fast)
  • Door-to-door (trucks for the “last mile”)
  • Dangerous goods transport (the ethanol delusion generates huge revenues for US railroads)
  • Large loads transport (anything oversize or super heavy: huge vehicles, transformers, tanks…)
  • Livestock transport
  • Rolling-stock transport (railways get paid to deliver empty wagons back their owners)
  • Finance (a railway can provide credit services to customers)
  • Ancillary freight services: customs clearance, shipping, security…
  • Passenger transport
  • Luggage
  • Lost luggage
  • Bicycles
  • Pet transport
  • Food and drink services (onboard and in stations)
  • Accommodation (big Indian stations all have dormitories and rooms)
  • Tours and entertainment (party trips, scenic trips, winery trips…)
  • Land cruises (just like a cruise ship but on rails)
  • Travel agency
  • Bulk goods storage (railroads charge companies to hold bulk materials in wagons for them awaiting demand: they provide a buffering service)
  • Rolling stock storage (in the USA railroads make money storing surplus freight wagons for other railroads)
  • Rolling stock repair (railways repair private equipment for the owners)
  • Private carriage transport (in many countries you can own your own railroad carriage and have the railway carry you around; a fantasy of mine)
  • Property rental (many large railways are significant landlords)
  • Land sales

Where’s the timetable or ticket pricing now?  It has such a small part in the true scope of a railway’s services as to be trivial.  More to the point, it is not a service: tickets are request options associated with one of many services.  Users don’t request a service: “I’d like an email please”. No, they make a request for an option associated with a service: provision email, increase mailbox, delete account, retrieve deleted email etc…

People confuse their personal consumer experience with business: they try to apply consumer experience models to business processing.  Most customers don’t want a service catalogue “to look like Amazon”.  They want meaningful business information.  The best vehicle for that is usually a text document.  The users/consumers of a service(s) may want to see the requests associated with that service(s) in an Amazon-like interface.  Sometimes there may even be a valid business case for building them a groovy automated request catalogue, but it is not the service catalogue.

The service catalogue defines what we do.  It is not simply an ordering mechanism for customers.  That is that personal/business thing again.  A service catalogue has multiple functions.

  1. Yes it is a brochure for customers to choose from.
  2. It also provides a structure to frame what we do as a service provider: availability planning, incident reporting, server grouping… Once we have a catalogue we find ourselves bring it up in diverse contexts: “we see the list of services show up in the table of contents”.
  3. It is a reference to compare against when debating a decision
  4. It is a benchmark to compare against when reporting (especially the service levels, but not only the service levels)
  5. It becomes a touchstone, a rallying point, an icon, a banner to follow.  It brings people back to why we are here and what we are for as an organisation.

You don’t get that from Amazon.

Then we come to that endless source of confusion and debate: technical service catalogue.  That deserves a whole discussion of its own, so we will look at it next…