Root Cause – Railways don't like derailments

Most readers have got the story now from my recent articles: Cherry Valley, Illinois, 2009, rain bucketing down, huge train-load of ethanol derails, fire, death, destruction.

Eventually the Canadian National’s crews and the state’s emergency services cleaned up the mess, and CN rebuilt the track-bed and the track, and trains rolled regularly through Cherry Valley again.

Then the authorities moved in to find out what went wrong and to try to prevent it happening again.  In this case the relevant authority is the US National Transportation Safety Board (NTSB).

Keep asking why?

We have a fascination with finding a root cause - chances are that there is not just one

Every organisation should have a review process for requests, incidents, problems and changes, with some criteria for triggering a review.

In this case it was serious enough that an external agency reviewed the incident.  The NTSB had a good look and issued a report.  Read it as an example of what a superb post-incident review looks like.  Some of our major IT incidents involve as much financial loss as this one and sadly some also involve loss of life.

IT has a fascination with “root cause”.  Root Cause Analysis (RCA) is a whole discipline in its own right.  The Kepner-Tregoe technique (ITIL Service operation 2011 Appendix C) calls it “true cause”.

The rule of thumb is to keep asking “Why?” until the answers aren’t useful any more, then that – supposedly – is your root cause.

This belief in a single underlying cause of things going wrong is a misguided one.  The world doesn’t work that way – it is always more complex.

The NTSB found a multitude of causes for the Cherry Valley disaster.   Here are just some of them:

  • It was extreme weather
  • The CN central rail traffic controller (RTC) didn’t put out a weather warning to the train crew which would have made them slow down, although required to do so and although he was in radio communication with the crew
  • The RTC did not notify track crews
  • The track inspector checked the area at 3pm and observed no water build-up
  • Floodwater washed out a huge hole in the track-bed under the tracks, leaving the rails hanging in the air.
  • Railroads normally post their contact information at all grade crossings but the first citizen reporting the washout could not find the contact information at the crossing where the washout was, so he called 911
  • The police didn’t communicate well with CN about the washed out track: they first alerted two other railroads
  • There wasn’t a well-defined protocol for such communication between police and CN
  • Once CN learned of the washout they couldn’t tell the RTC to stop trains because his phone was busy
  • Although the train crew saw water up to the tops of the rails in some places they did not slow down of their own accord
  • There was a litany of miscommunication between many parties in the confusion after the accident
  • The federal standard for ethanol cars didn’t require them to be double-skinned or to have puncture-proof bulkheads (it will soon: this tragedy triggered changes)
  • There had been a previous washout at the site and a 36” pipe was installed as a relief drain for flooding.  Nobody calculated what size pipe was needed and nobody investigated where the flood water was coming from.  After the washout the pipe was never found.
  • The county’s storm-water retention pond upstream breached in the storm.  The storm retention pond was only designed to handle a “ten year storm event”.
  • Local residents produced photographic evidence that the berm and outlet of the pond had been deteriorating for several years beforehand.

OK you tell me which is the “root cause”

Causes don’t normally arrange themselves in a nice tree all leading back to one. There are several fundamental contributing causes. Anyone who watches Air Crash Investigation knows it takes more than one thing to go wrong before we have an accident.

Sometimes one of them stands out like the proverbial. So instead of calling it root cause I’m going to call it primary cause. Sure the other causes contributed but this was the biggest contributor, the primary.

Ian Clayton once told me that root cause …er… primary cause analysis is something you do after the fact as part of the review and wash-up. In the heat of the crisis who gives a toss what the primary cause is – remove the most accessible of the causes. Any disaster is based on multiple causes. Removing any one cause of an incident will likely restore service.  Then when we have time to consider what happened and take steps to prevent a recurrence, we should probably try to address all causes.  Don’t do Root cause Analysis, just do Cause Analysis, seeking the multiple contributing causes.  If we need to focus efforts then the primary cause is the one, which implies that a key factor in deciding primacy is how broad the potential is for causing more incidents.

All complex systems are broken

It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management.  You must read it.  Now.  I’ll wait.

It says that all complex systems are broken.   It is only when the broken bits line up in the right way that the system fails.

It dates from 1998!  Richard Cook is a doctor, an MD.  He seemingly knocked this paper off on his own.  It is a whole four pages long, and he wrote it with medical systems in mind.  But that doesn’t matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.

“Complex systems run as broken systems”

“Change introduces new forms of failure”

“Views of ‘cause’ limit the effectiveness of defenses against future events… likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly.”

“Failure free operations require experience with failure.”

Many times the person “to blame” for a primary cause was just doing their job. All complex systems are broken. Every day the operators make value judgements and risk calls. Sometimes they don’t get away with it. There is a fine line between considered risks and incompetence – we have to keep that line in mind. Just because they caused the incident doesn’t mean it is their fault. Think of the word “fault” – what they did may not have been faulty, it may just be what they have to do every day to get the job done. Too often, when they get away with it they are considered to have made a good call; when they don’t they get crucified.

That’s not to say negligence doesn’t happen.   We should keep an eye out for it, and deal with it when we find it.  Equally we should not set out on cause analysis with the intent of allocating blame.   We do cause analysis for the purpose of preventing a recurrence of a similar Incident by removing the existing Problems that we find.

I will close by once again disagreeing with ITIL’s idea of Problem Management.  As I said in my last article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.

It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?”  That is dealt with by Risk Management (an essential practice that ITIL does not even recognise) feeding into Continual Service Improvement to remove the risk.  The NTSB were not doing Problem Management at Cherry Valley.

Next time we will look at continual improvement and how it relates to problem prevention.

Image credit – © Van Truan – Fotolia.com

Expanding Customer Service To Twitter

“Providing quick, engaging and valuable support to your customers on Twitter can build a positive brand image and reduce cost. Twitter takes less time, and money and offers your company the ability to resolve issues quickly and efficiently.”

Image originally posted on Zengage, The Zendesk Blog

From Tel Aviv to Beverley Hills on an ITSM mission

Tel Aviv based SysAid Technologies has announced that residential real estate brokerage Realty ONE Group has selected SysAid 9.0 integrated ITSM software.

This deal is said to give Realty ONE Group the chance to use SysAid to manage its internal IT department and to assist with providing both in-office and mobile technical support for its nearly 4,000 agents across the western United States.

“We chose SysAid because it is solid functioning service desk software that assists our agents from the office and the field, allows for more efficient processes,” said Kuba Jewgieniew, CEO and owner of Realty ONE Group. “We are excited about the SysAid roll-out and look forward to taking advantage of the many IT tools and capabilities that the platform offers.”

SysAid’s ITSM solution, available in both in-house and cloud editions, is designed to provide IT administrators a variety of best-practice tools to oversee and manage company help desks and asset inventories from the office or on the go, from any Internet-connected device. SysAid’s mobile platform also allows company employees to submit IT requests from their own mobile devices.

Additionally, the solution includes Mobile Device Management (MDM) capabilities intended to allow administrators to extend their IT management policies across all mobile devices on the network — corporate and employee owned devices (the BYOD trend).

“We are pleased to have been selected by a well-known and reputable company such as Realty ONE Group,” said Israel Lifshitz, founder of SysAid. “This decision highlights the important value that a flexible IT management solution brings to Realty ONE’s entire business, and we are pleased to provide the company with a solution that meets its requirements.”

NOTE: SysAid Technologies also has offices in Australia and Brazil.

Getting Started with Social IT (Part 1 of 2)

Today’s post from Matthew Selheimer of ITinvolve is part one of a two-part feature on Social IT maturity, part 2 will follow soon. 

"Most of your customers, employees and stakeholders are actively using social media"

Today, 98 percent of the online population in the USA uses social media sites, and worldwide nearly 6 out of every 10 people use social networks and forums.

From a business perspective, this means a very large percentage of your customers, employees and other stakeholders are already participating in the social media universe where smartphones, tablets, video communication and collaboration are a part of daily life. It almost goes without saying that, if you want to connect with new audiences and marketplaces today, there is no other platform that compares to social media in reach and frequency.

In fact, a recent McKinsey & Company report suggests that the growth businesses of tomorrow will be those that harness the power of social media and its potential benefits not only externally but internally as well:

Most importantly, we find that social technologies, when used within and across enterprises, have the potential to raise the productivity of the high-skill knowledge workers that are critical to the performance and growth in the 21st century by 20 to 25 percent.’

We are social by nature

How might IT departments take advantage of this social media potential? IT organizations are, in fact, quite social by nature. Knowledge and expertise reside in different teams, and specialists must frequently come together and collaborate to plan for changes and resolve issues. These social interactions, however, are typically ad hoc and take place across a wide variety of methods from in-person conversations and meetings, to email, to phone calls, to instant messaging, to wiki sites, and more.

How can IT build upon its existing social culture to deliver new value for the broader organization?

To be considered as more than just a ‘nice to have,’ social media must provide tangible benefits. The good news is that social media principles do provide real benefits when applied to IT – and they do so in a big way. For example, IT organizations that are using social media principles are finding that their staff can interact with users and each other in new and more immediate ways. They are also finding that they can much more easily capture and share the collective knowledge residing across their systems and teams; and then armed with this knowledge, they are able to better understand their IT environment and the complex relationships that exist among their IT assets.

Being social brings risks and rewards

This, in turn, is leading to increases in staff productivity and is making day-to-day tasks like resolving incidents and planning for changes more efficient and more accurate. The results include faster time to restore service when outages or degradations occur, a higher success rate when executing changes, and a greater overall throughput of IT process management activities – just to name a few.

But the adoption of social media principles in IT also has the risk of certain pitfalls. In this article, we will explore a four-level model of social IT maturity, (See Figure 1) including how to avoid the most common pitfalls.

  • At Level 1, organizations begin to explore how social IT can contribute by defining a milestone-based plan with clearly established benefits as their social IT maturity increases.
  • At Level 2, IT takes specific actions to add on social capabilities to existing operations, and begins to realize projected benefits around user intimacy and satisfaction.
  • At Level 3, social IT becomes embedded into and enhances IT operational processes, providing relevant context to improve collaboration among IT professionals thereby making IT teams more efficient and accurate in their daily work.
  • Finally, at Level 4, IT evolves into a socially driven organization with a self-sustaining community, recognition and rewards systems that further incentivize the expansion of the community, and a culture that harnesses the power of social collaboration for continuous process improvement.

 

Figure 1 - A Proposed Social IT Maturity Model

Level 1 Maturity: Social Exploration

The first level of social IT maturity is Social Exploration. The goal of Social Exploration is to learn, and the value delivered comes from defining your plan to improve social IT maturity.

Such a plan must include specific key performance measures that can be tied to financial or other tangible business benefits. Otherwise, your social IT plan is bound to be greeted skeptically by management.

Start by asking yourself simple questions like ‘How can social tools improve my ability to provide better IT service and support?’ and ‘What social IT capabilities are available in the market that I should know about and consider for my organization?’ If you’ve not started asking these types of questions, then you aren’t even on the social IT maturity scale yet. Exploring what social IT could mean for your IT organization is the critical first step.

To exit Level 1 and move to Level 2 on the maturity scale, you must have a documented plan for how you will improve your social IT maturity that incorporates specific key performance measures. The following sections will discuss a variety of elements and performance measures that you should consider.

Social IT Pitfall #1: Ungoverned Broadcasting

In your transition from Level 1 to Level 2 maturity, a common pitfall is to look for a ‘quick win’ such as broadcasting via Twitter or RSS. A number of IT management software vendors include this capability in their products today, so it seems like an easy way to ‘go social.’ However, if you haven’t taken the time to define your communications policies clearly, you could end up doing more harm than good. Posting IT service status to public feeds could leave your organization exposed or embarrassed. You wouldn’t want to see ‘My Company finance application unavailable due to network outage’ re-tweeted and publicly searchable on Google, would you?

You can do more harm than good if you try for a ‘quick win’ approach to social IT by broadcasting via Twitter or RSS. Posting IT service status to public feeds could leave your organization exposed or embarrassed.

Level 2 Maturity: Social Add-ons

The most important thing about getting to Level 2 maturity, Social Add-ons, is that you are now taking specific actions to leverage social capabilities as part of your overall IT management approach.

While some organizations may choose to move directly to Level 3 maturity, because of its greater value, a common next step in increasing social IT maturity is the adoption of one or more social capabilities as add-ons to your existing IT processes. The goals at this stage are typically to leverage social capabilities to improve communications with users and, to a lesser extent, within IT.

The value of Level 2 social IT maturity is defined in terms of metrics such as user satisfaction, the percentage of incidents or requests that have been acted upon within their prescribed SLAs, and the creation of formal social IT communications policies that clarify what should be communicated to whom and when.

A logical place to start is to evaluate the social add-on capabilities of your current IT management software. You may find that your current vendor offers some type of 1:1 chat (instant messaging, video-based, virtual chat agents, etc.), often with the ability to save or record that chat. You may also find support for news feeds and notifications (e.g. Twitter, RSS, Salesforce.com’s Chatter, Yammer, or Facebook integration). You might also consider using these approaches on a standalone basis outside of your current IT management software if your current provider does not offer these capabilities.

Define your communication policies

Remember the first social IT pitfall of broadcasting, though. Before you start communicating, you must define your formal communications policies. Most likely, you already have a policy that pertains to email or Intranet communications to users and employees. If you do, that’ll give you a head start to work from. In any case, here are a few good rules of thumb to follow:

  1. Only communicate externally what you are comfortable with the entire world knowing about. In most cases you will find there are very few things, if any, which fit into this category. For example, you might push out a tweet to a specific user’s twitter account that their incident has now been closed, but without any details about the nature of the incident.
  2. If you do want to communicate using social tools externally in a broader way, consider using private groups that are secure. For example Twitter, Chatter, and Facebook all support private groups, although there is administrative overhead for both users and IT departments to request to join them and to manage members over time.
  3. Make sure what you communicate is focused on a specific audience.Don’t broadcast status updates on every IT service to everyone. If you create too much noise, people will just tune out your communications defeating their entire purpose.

To exit Level 2 and start to move to Level 3 on the maturity scale, you need to shift both your thinking and your plans from social add-ons to how social capabilities can be embedded into the work IT does every day. This means expanding your social scope beyond IT and end user interactions, and working to improve collaboration within IT.

Social IT Pitfall #2: Feeds, Walls, and Noise – Oh My!

One critical success factor for social IT communications is to ensure you are targeting specific audiences. Some vendors offer a Facebook-like wall in addition to the ability to push updates out via Twitter or RSS. In addition to the exposure risk previously discussed, these approaches can also create a tremendous amount of noise, which will make it difficult for both business users and IT to identify useful information in the feed or on the wall.

Relying on a solitary Facebook-like wall for social IT, as well as pushing updates out via Twitter or RSS, can create a tremendous amount of noise, making it difficult for both business users and IT to identify useful information in the feed or on the wall.

There is a simple analogy to illustrate this point. Imagine you are invited to a dinner party and arrive as one of twenty guests. As you enter, you hear many conversations taking place at once, music playing, clinking of glasses behind the bar, the smell of food cooking. What’s the first thing you do? If you’re like most people, you look around the room to find someone else you know, someone who appears interesting, or maybe you head toward the bar or the kitchen. What you’ve just done is to establish context for the party you’re attending. A single IT news feed or wall doesn’t provide useful context. It’s like listening to random sentences from each of the conversations at the party and contains a lot of noise that a business or IT user just doesn’t care about.

While news feeds and walls typically have a keyword search capability, both users and IT users will end up spending too much time trying to locate relevant information. As a result, they will likely over time start avoiding going to the feed or wall because it contains far too much information they don’t care about. What’s more, the feed can grow so long that it needs to be truncated periodically causing useful information that was posted a long time ago to become lost to the organization.

Stay away from one-size fits all walls or feeds. They’re not useful and will hurt the credibility of your social IT project.

This is part one of a two-part feature on Social IT maturity, part 2 will follow soon.

CHANGE: Don't be a Statistic!

Change is inevitable, how you manage the organizational aspects will make all the difference

For decades the industry experts have been telling us that 70% of organizational change fails. These experts include recognizable names such as Kotter, Blanchard, and McKinsey. It’s a scary story! This means that 70% of changes fail to recognize a return on investment, and achievement of stated goals and objectives.

In service management the change could be the introduction of new technology, organizational restructure or process improvement. These changes could represent a significant investment of time, money and resources. So can we afford not to break the cycle and allow history to keep repeating itself?

Not only do failed changes result in wasted time, money and effort but they also make subsequent changes even harder. Failed changes result in cynicism, lost productivity, low morale and change fatigue. Expecting people to change their ways of working will be increasingly harder if they have been subject to a series of failed change initiatives in the past.

Failure is due to lack of focus on organizational change

It is my belief that 70% of changes fail due to the lack of focus on organisational change. Projects have specific objectives including on time, on budget and delivery of specified functionality. Projects install changes but do not implement them unless organizational change is included within the project.

If a change is to be truly embedded into the fabric of the organization and recognize the desired outcomes, there has to be a focus on the people. Organizational change is a challenge to many because it involves people and every one of those people is different. They have different desires, beliefs, values, attitudes, assumptions and behaviors. An individual may embrace one change because it is aligned with their value system and they can answer the “What’s in it for me?” (WIIFM) question. The next change may be rejected because that question cannot be answered or the change is perceived to have a negative impact on the individual, their role or their career.

Therefore before we embark on any change we need to clearly identify the target audience. That is anyone who may be impacted in anyway by this change, both directly and indirectly. We need to understand the target audience and their level of awareness for the need to change.

Through identification of the right sponsors for the change at every level within the organization, equipping them with the skills and capability to raise awareness and create a desire to change we are more likely to have a successful outcome.

Announcing is not implementing!

A communication strategy and plan is a key component of the organizational change program. It needs to address the needs of the audience, the key messages and how they will be packaged and delivered. The sender of the message is important. Messages around the business need for change are received better when they are delivered from the CxO level. Messages about how the change will affect an individual are better received from that person’s manager or supervisor.

We also need to select a variety of practices or activities to ensure that the change becomes embedded and people do not revert to old ways of working or their comfort zone.

In 2011 I wrote a book called ‘Balanced Diversity – A Portfolio Approach to Organizational Change’ which provides 59 distinct practices that can be selected from to embed change. Although the framework introduced in the book can be used for any change in any organization, I have discussed how it can specifically be used within IT service management.

In IT we often expect that providing people with some training or undertaking an ITSM maturity assessment will create the desire for change. It takes much more than that and because every change is different we need to use different practices that address the specific needs of the change and the target audience.

I don’t have the space here to discuss each of the 59 practices but a white paper on the subject can be read here.

Keep checking – there is no room for  ‘set and forget’

Finally it is critical to keep checking back to ensure that the change is being successfully embedded into the organization. We often talk about the Demming cycle in IT service management but don’t apply it to organizational change. We need to plan what practices we are going to use to embed the change, do them but them continually check back to ensure they are having the desired effect. If they are not, we need to act before the situation is irretrievable.

If you don’t want to become a 70% statistic and you want to ensure that your changes are a success, get some organizational change capability on your projects. It will be worth the investment.

Image – © Lilya – Fotolia.com

Rob England: Proactive Problem Management

Just because you rebuild the track doesn’t mean the train won’t derail again.

Rebuilding the track was reactive Problem Management

We have been looking in past articles at the tragic events in little Cherry Valley, Illinois in 2009.  One person died and several more were seriously injured when a train-load of ethanol derailed at a level crossing. We talked about the resulting Incident Management, which focused on customers, trains and cargo – ensuring the services still operated, employing workarounds. Then we considered the Problem Management: the injured people and the wreck and the broken track – removing the causes of service disruption, restoring normal service.

A Problem is a problem, whether it has caused an Incident yet or not

In a previous article I said ITIL has an odd definition of Problem.  ITIL says a Problem is the cause of “one or more incidents”.   ITIL promotes proactive (better called pre-emptive) Problem Management, and yet apparently we need to wait until something causes at least one Incident before we can start treating it as a Problem.  I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town.  A Problem is in fact the cause of zero or more Incidents.  A Problem is a problem, whether it has caused an Incident yet or not.

We talked about how I try to stick to a nice crisp simple model of Incident vs. Problem.  To me, an incident is an interruption to service and a problem is an underlying (potential) cause of incidents.  Incident Management is concerned with the restoration of expected levels of service to the users.  Problem Management is concerned with removing the underlying causes.

ITIL doesn’t see it that crisply delineated: the two concepts are muddied together.  ITIL – and many readers – would say that putting out the fires, clearing the derailed tankers, rebuilding the roadbed, and relaying the rails can be regarded as part of the Incident resolution process because the service isn’t “really” restored until the track is back.

Problems can be resolved with urgency

In the last article I said this thinking may arise because of the weird way ITIL defines a Problem.  I have a hunch that there is a second reason: people consider removing the cause of the incident to be part of the incident because they see Incident=Urgent, Problem=Slow.  They want the incident Manager and the Service Desk staff to hustle until the cause is removed.  This is just silly.   There is no reason why Problems can’t be resolved with urgency.  Problems should be categorised by severity and priority and impact just like Incidents are.  The Problem team should go into urgent mode when necessary to mobilise resources, and the Service Desk are able to hustle the Problem along just as they would an Incident.

This inclusion of cause-removal over-burdens and de-focuses the Incident Management process.  Incident Management should have a laser focus on the user and by implication the customer.  It should be performed by people who are expert at serving the user.  Its goal is to meet the user’s needs.   Canadian National’s incident managers were focused on getting deliveries to customers despite a missing bit of track.

Problem Management is about fixing faults.  It is performed by people expert at fixing technology.  .  The Canadian National incident managers weren’t directing clean-up operations in Cherry Valley: they left that to the track engineers and the emergency services.

Problem management is a mess

But the way ITIL has it, some causes are removed as part of Incident resolution and some are categorised as Problems, with the distinction being unclear (“For some incidents, it will be appropriate…” ITIL Service Operation 2011 4.2.6.4).  The moment you make Incident Management responsible for sometimes fixing the fault as well as meeting the user’s needs, you have a mashup of two processes, with two sometimes-conflicting goals, and performed by two very different types of people.  No wonder it is a mess.

It is a mess from a management point of view when we get a storm of incidents.  Instead of linking all related incidents to an underlying Problem, we relate them to some “master incident” (this isn’t actually in ITIL but it is common practice) .

It is a mess from a prioritisation point of view.   The poor teams who fix things are now serving two processes:  Incident and Problem.  In order to prioritise their work they need to track a portfolio of faults that are currently being handled as incidents and faults that are being handled as problems, and somehow merge a holistic picture of both.  Of course they don’t.   The Problem Manager doesn’t have a complete view of all faults nor does the Incident Manager, and the technical teams are answerable to both.

It is a mess from a data modelling point of view as well.  If you want to determine all the times that a certain asset broke something, you need to look for incidents it caused and problems it caused

Every cause of a service impact (or potential impact) should be recorded immediately as a problem, so we can report and manage them in one place.

All that tirade is by way of introducing the idea of reactive and proactive Problem Management.

Cherry Valley needed Reactive Problem Management

Reactive Problem Management responds to an incident to remove the cause of the disruption to service.  The ITIL definition is more tortuous because it treats “restoring the service” as Incident Management’s job, but it ends up saying a similar thing: “Reactive problem management is concerned with solving problems in response to one or more incidents” (SO 2011 4.4.2).

Pro-active Problem Management fixes problems that aren’t currently causing an incident to prevent them causing incidents (ITIL says “further” incidents).

So cleaning up the mess in Cherry Valley and rebuilding the track was reactive Problem Management.

Once the trains were rolling they didn’t stop there.  Clearly there were some other problems to address.  What caused the roadbed to be washed away in the first place?  Why did a train thunder into the gap at normal track speed?  Why did the tank-cars rupture and how did they catch fire?

Find the problems that need fixing

In Cherry Valley, the drainage was faulty.  Water was able to accumulate behind the railway roadbed embankment, causing flooding and eventually overflowing the roadbed, washing out below the track, leaving rails dangling in the air.  The next time there was torrential rain, it would break again.  That’s a problem to fix.

Canadian National’s communication processes were broken.  The dispatchers failed to notify the train crew of a severe weather alert, which they were supposed to do.  If they had, the train would have operated at reduced speed.  That’s a problem to fix.

The CN track maintenance processes worked, perhaps lackadaisically but they worked as designed.  The processes could have been a lot better, but were they broken?  No.

The tank cars were approved for transporting ethanol.   Those were not required to be equipped with head shields (extra protection at the ends of the tank to resist puncturing), jackets, or thermal protection.  In March 2012 the US National Transportation Safety Board (NTSB) recommended (R-12-5 ) “that all newly manufactured and existing general service tank cars authorized for transportation of denatured fuel ethanol … have enhanced tank head and shell puncture resistance systems”.  The tank-cars weren’t broken (before the crash).  This is not fixing a problem; it is improving the safety to mitigate the risk of rupture.

Proactive Problem Management prevents the recurrence of Incidents

I don’t think pro-active Problem Management is about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.  That is once again over-burdening a process.  If you delve too far into preventing future problems, you cross over into Availability and Capacity and Risk Management and Service Improvement, (and Change Management!), not Problem Management.

ITIL agrees: “Proactive problem management is concerned with identifying and solving problems and known errors before further incidents related to them can occur again”.   Proactive Problem Management prevents the recurrence of Incidents, not Problems.

In order to ensure that incidents will not recur, we need to dig down to find all the underlying causes.  In many methodologies we go after that mythical beast, the Root Cause.  We will talk about that next time.

Image credit