Podcast Episode 11 – Structured Problem Analysis

2808468566_dc22dede4b_zIn Episode 11 of The ITSM Review podcast Rebecca Beach discusses structured problem analysis and problem solving methodologies with guests Simon Morris and Tobias Nyberg.

Topics include:

  • Strategic activity
  • Kepner Tregoe
  • Slack time and overload
  • Bias/challenging assumptions
  • Maturity barriers
  • Playing the blame game
  • Sex, drugs and rock-n-roll

Books mentioned and further reading/information:

Image Credit

ITAM Review and ITSM Review Feeds

2013: A Year in ITSM Review

Merry Christmas and a Happy New Year!
Merry Christmas and a Happy New Year!

As 2013 begins to draw to a close, I thought it would be nice to finish off the year with a final article that’s an overview of what has happened at the ITSM Review over the last 12 months.  That’s right, this will be our last post for 2013 because the entire team is heading off to fill their faces with mince pies and sherry. But don’t worry we’ll be back in 2014 with slightly bigger waistlines and lots of exciting plans for 2014 (insight into which you can find at the end of this article).

Ironically I like neither mince pies nor sherry. 

Visits and Growth

  • We have had nearly 230,000 page views this year, an increase of a whopping 210% from 2012!!! A huge thank you to the circa 120,000 of you for coming to read our content.
  • Visits to our site increased by an astounding 58% between the end of June and end of July alone, and then continued to grow on average by 5.5% every month.
  • Our Twitter followers increased by 193%.

One thing that I think it’s worth pointing out here as well is that the bulk of our readers are not actually situated in the UK (which is what a lot of people presume given that this is where we are based). In 2013, 17% of our readers were from the UK, but an impressive 30% were actually from the USA. Perhaps we should open a US office?! A large proportion of visitors also came from India, Germany, Australia, Canada, The Netherlands, France and Sweden, as well as plenty of other countries too.

Owing to us attracting more and more visitors year-on-year from outside of the UK and America, we are increasingly being asked to produce region-specific content. We are therefore looking for practitioners, consultants or analysts based in Asia, South America, Africa, and Europe who would be interested in writing about their experiences of ITSM in other countries. If you are interested please get in touch.

What was popular?

The top 3 most-viewed articles of the year were:

  1. 7 Benefits of using a Known Error Database (by Simon Morris)
  2. Gartner Magic Quadrant for IT Service Support Management Tools (Martin Thompson)
  3. AXELOS: Capita and ITIL joint venture lift lid on new brand (Martin Thompson)

Of those articles only number 3 was actually written and published in 2013.

I have to say congratulations specifically to Simon Morris here as well, because his KEDB article was not only the most-read article of the year, but it achieved 37% more hits than the second most popular article of the year! (And that’s not counting the hits it originally got in the year it was published).

Of the articles written and contributed in 2013, the top 3 were:

  1. Future of ITIL workshop – a little insight (Stuart Rance and Stephen Mann)
  2. Four Problem Management SLAs you really can’t live without (Simon Higginson)
  3. 7 golden rules for getting the most from the Service Catalogue (Yemsrach Hallemariam)

Is there a specific topic that you would like us to write about? Are there are practical pieces that you would like to see us cover to help you in your day-to-day job? Please let us know.

Content Contributors

In 2013, we were pleased to welcome 3 new, regular content contributors to the ITSM Review.  These are people who now write for us on a regular basis (roughly once a month), so you can expect to see a lot more great content from them in 2014. They are:

We also published content for the first time from the following companies: Cancer Research UK; EasyVista; Fruition Partners; GamingWorks; LANdesk; Macro4; Oregon Department of Transportation; Service Management Art Inc; and xMatters.

A great big thank-you to all of our regular and ad hoc contributors for helping supply with us with such fantastic content.

If you’re reading this and think you might be interested in contributing content (we welcome content from all, including) please get in touch.

Top Searches

Given that we had over 230,000 pages view this year, I thought that many of you might be interested to see what it was that people were searching for on our site.  The top 20 searches of the year were as follows:

  1. KEDB
  2. AXELOS
  3. Known Error Database
  4. ITSM
  5. Issue Log
  6. Proactive Problem Management
  7. ITSM Software
  8. Gartner ITSM
  9. What is Service Management
  10. Cherwell Software Review
  11. Gartner ITSM Magic Quadrant
  12. ServiceNow Review
  13. ITSM Software Review
  14. ITSM News
  15. Major Incident Management Process
  16. Free ITIL Training
  17. RemedyForce Review
  18. BMC Footprints
  19. KEDB in ITIL
  20. Process Owner

Are there any search terms that you are surprised to see on there?  Or anything that you would have expected to see that isn’t?

Events

In 2013 we branched out and kicked off Media Partnerships at the itSMF UK Conference and Exhibition (Birmingham) and itSMF Estonia Conference (Tallin).

Our aim was not only to spread the word about The ITSM Review, but to spend time with delegates to find out what things they are struggling with and how we might be able to help them.

Next year you can expect to see us the PINK conference in Las Vegas, and we hope to announce some other new, exciting partnerships for 2015 in the New Year!

Launches

In May we launched the ITSM Review App (Search ‘ITSM’ in the Apple App Store). 

Then there is the ITSM Tools Universe, which we launched at the end of November. The Tools Universe hopes to shed light on the emerging ITSM players (as well as the major competitors) and, over time, the changes in the position of the companies involved and moves in market share. Most importantly it is free to participate and unlike any Magic Quadrant or Wave, the ITSM Tools Universe is open to ALL ITSM vendors. 9 vendors are already confirmed.

If you are a Vendor and are interested in learning more the ITSM Tools Universe please contact us.

Additions to the team

As of 1st January 2013 the ITSM Review was still simply just the man you all know and love Martin Thompson (he tried desperately to get me to remove what I just said there… modest and all that jazz).

However, ITSM Review finished 2013 with an additional 3 employees:

  • In January 2013 Glenn Thompson (you’d be right to suspect that they might be related) joined full-time as the company’s Commercial Director. For some reason there was no official announcement (we’ll blame Martin) so for some of you this might be the first you’ve heard of it! Without Glenn we’d struggle to continue to offer all of our content to readers free of charge, so despite the fact that he’s a Chelsea fan, you’ve got to like him.
  • In July, for some reason Martin decided it would be a good move to hire some strange blonde lady who liked penguins (that would be me) as the Marketing and Community Manager.
  • Finally, in October Rebecca Beach joined as a Research Analyst. Famous for being a “gobby midget”, Rebecca will be writing most of our ITSM research and reviews in 2014. Rebecca also spends time (in conjunction with me) making fun of Martin and Glenn on a regular basis (it’s not our fault they make it so easy).

So then there was 4.

If you’re interested in any upcoming job opportunities at the ITSM Review (or ITAM Review), then please let us know.  We certainly plan on increasing that number 4 in 2014.

What’s planned for 2014?

Next year we are hoping to broaden our coverage of the ITSM space even further by securing new content contributors; participating in more industry events; launching new products (such as video product reviews, webinars, and case studies); and more.

We’re also looking very seriously at the possibility of running regular ‘social meet ups’ like we recently did with the Christmas get-together.

In addition to the publication of our ITSM Tools Universe in the Spring we will also be continuing our Group Tests, and a full list of topics for the Group Test series will be published early January.

In addition to the above we also have some planned changes in the works for our website. Nothing too major (it will still look like the ITSM Review that you know and love), just some cosmetic updates to make it easier on the eye and increase your ability to easily find what you are looking for.

Watch this space and we’ll keep you updated of our plans throughout 2014!

Oh and if you’re interested in the 2013 review and plans for 2014 from the ITAM Review, you can read them here.

Is there anything you would like to see us doing in 2014 that we’re not doing currently? Are there any changes that you would like to suggest to the website? Would you be interested in a tooling event or social get-togethers? Are you a Vendor who is interested in our Group Tests? We welcome your feedback, so please get in touch.

And so…

2013 is drawing to a close. Our success and growth throughout the year has made everybody here happy bunnies; but most importantly we hope that our content / site / presence this year has made YOU a bunch of happy bunnies. The whole purpose of the ITSM Review is to help ITSM practitioners, and everything we do has that end goal in mind.  Even if we only gain an additional 5 readers in 2014, so long as our content aids those 5 people and makes their work lives easier then these bunnies will continue to have smiles on their faces.

So with that image of turning the entire ITSM industry into smiley rabbits, I bid you all a Merry Christmas and a Happy New Year!  Thanks for reading throughout 2013; without you… the ITSM Review doesn’t exist.

Image Credit

Applying Agile principles to Service Management

rabbits
“A man who chases two rabbits catches none.”

Regular readers of my blog on The ITSM Review, or over at ScrumProUK know that I enjoy exploring the gap between the worlds of IT Service Management and Agile development methodologies.

Today I wanted to go back to basics, back to square one and start over by explaining why you – as a reader from the world of Service Management or corporate IT – should care about Agile principles and what it can bring to your organisation.

I’m glad to see one commonly held misconception being broken down and disproven recently. The assumption that an organisation cannot be both Agile and IT Service Management orientated.

Even the oldest, most stubborn and most skeptical of Agile critics (his words!!) are coming around to the idea that an organisation that can excel at both disciplines. Hurrah. There is a growing Google+ community dedicated to it – Kamu: Uniting DevOps and ITSM.

To introduce principles to those that are unfamiliar, I’m taking inspiration from Tobias Mayer who identified 5 benefits of Agile (in this particular case Scrum) that will help me orientate you.

Focus

“A man who chases two rabbits catches none.” ~ Roman Proverb

A core principle of popular Agile methodologies such as Scrum and Kanban is to limit Work in progress. Scrum teams, for example, will agree to take on a small subset of work from the overall backlog within a timeboxed period, normally between 2 and 4 weeks.

By limiting the teams focus and attention on what is most important you enable them to complete work to the appropriate quality standards; and by limiting work in progress we train teams to finish work, rather than start additional work. With focus comes attention to detail and less mistakes, a higher level of quality and ultimately happier customers.

Look around your IT department today as you read this article. Do you see teams that have more work than they can handle? (Probably). Do those teams have a clear understanding of what is most important (probably not)?

How can Service Management teams adopt the Agile principle of providing focus for their teams?

Start by understanding your work. Where does it come from? How does work arrive in your department. Visualise your work by using a tool like LeanKit, Trello or Kanbanize (all have free editions for you to try). Use one of these tools to identify which work items are the most important and challenge the team to finish those items.

By reducing the scope of work that a team is paying attention to you’ll see a change in behaviour, delivery time and quality.

Alignment

“What if we found ourselves building something that nobody wanted? In that case what did it matter if we built it on time and on budget….” ~ Eric Ries, The Lean Startup

Agile teams work with the principle that plans will change; that we will understand more about the work once we near completion and that no amount of planning really prepares us for the road ahead.

This is true for software development projects where Agile is accepted but of course it’s also true for IT maintenance and operational projects too. How many of your projects delivered exactly as predicted on day one?

Knowing that business requirements will change frequently and that the assumptions made before work begins are normally wrong, Agile teams handle this by working in iterations.

By planning months into the future with “just enough” detail and by focusing in granular detail on only the next 2 week sprint, a team can easily absorb changing business requirements.

By meeting with the business on a frequent basis, by examining the overall plan (in coarse detail) and by re-prioritising against the current business requirements Agile teams achieve alignment with the business. They can plan for the next iteration in detail knowing they are working on the most important thing based on todays knowledge.

It’s no use being perfectly aligned at the start of the project and not having a system to cope with the ever changing demands. Changing requirements in a project are a good thing – it means we will have a better solution in the end.

Do your IT project teams try and control changing requirements… do you welcome them?

How can Service Management teams achieve alignment with the business?

By structuring work so that teams can focus in the short term but change direction to react to business demands. For Service Management teams this might mean short term focus on a set of metric goals to solve a particular business problem. Just having the routine of sitting with the business and reviewing priorities is a great first step.

Artful making

“I don’t test my code often but when I do I do it in production” ~ Internet meme

Earlier I mentioned Focus as a principle of Agile teams and that by concentrating on a small subset of work that is most important to the business we can train teams to deliver. Rather than having lots of work open and diluting the teams attention.

There’s another benefit in limiting Work In Progress with regards to quality engineering. Imagine a team that has no control over its work and everything is urgent. The team has no Focus and no Alignment with the business – no understanding of what is truly important.

That team is likely to produce low quality work. By trying to complete everything at once they’ll often do just enough to call it done. This results in risk lurking in your infrastructure; the worst kind of work that will leap out on you when you aren’t expecting it. Work you thought was done… but isn’t. Rework!!

Agile teams use the system of limited WIP as well as technical practices and standards to get work done once so they can move on to the next task.

Could you improve the quality of work by defining standards and shielding your team by limiting work in progress?

How can Service Management team promote artful making?

Have a “Definition of Done” for common activities. Not a huge, heavy operations manual that no-one will ever read but a collection of one-page definitions of what it means to be done with a server build, a software installation, a Problem ticket.

Make the definition of done visible and easy to use, your engineers will know when they are finished with a piece of work before moving on.

Self-organization

“None of us is as smart as all of us.” ~ Kenneth H. Blanchard

The best architectures, requirements, and designs emerge from self-organizing teams. Teams that are not controlled but enabled. Teams that stay together long enough to form an esprit-de-corps and that trust each other enough to have passionate debate and disagreement without destroying the teams culture.

The worst experience that an engineer can have is to be presented work that was designed by someone else, work that has no scope for flexibility or creativity, and worst of all to be told how long it will take.

Have you ever worked on a project where the scope, implementation and deadline were predetermined by those that aren’t actually going to do the work? How does that even happen??

Agile teams are self-organising within the constraints of the organisation in which they operate. They receive requirements that describe the business need (The “WHY”) and acceptance criteria (“The WHAT”) and they, as a team, determine the solution (The “HOW”).

Self-organising teams scale much better than command-and-control style teams, where a manager delegates and defines the work.

Why would you want to have your expensive managers involved in assigning tasks and resource levelling? Members of a self-organising team know when they have spare capacity for more work and they pull work into their queue.

How can Service Management teams become more self-organising?

I think this is a simple one – do you have managers that delegate work or leaders that coach teams to success? If you have the former is that the best use of their time and skills? Give the team an opportunity to own their work and determine their own destiny, within the constraints of your organisation.

This loss of control by managers might result in a team more invested in its success, more motivated and higher performing.

Rhythm

“Rhythm is something you either have or don’t have, but when you have it, you have it all over.” ~ Elvis Presley

Agile teams are focused on the regular delivery of value into the businesses they serve. By limiting work to sprints, usually between 2 and 4 weeks long, they are able to continuously deliver work building a partnership based on trust.

elvisBecause they focus on a subset of all possible work and they have quality standards they can deliver work of a high quality which deepens that trust.

Short time-boxes focus teams on an objective they have to meet – by self-organising they control the scope of work that is achievable within that sprint. When I started delivering work to a company using Scrum I asked my stakeholders which attribute of the work they valued most.

Was it the speed or the volume of work, or the number of features we delivered? No – organisations rely on predictability and working in set time-boxes, or sprints, makes your team predictable.

Compare this to projects that defer the delivery of value until the end of the project. Rather than release early and often they buffer the features and aim to deliver all in one large batch.

If that deadline is delayed two unfortunate things happen – firstly trust between the team and the business is eroded. And secondly the value represented in the features that are done but not released cannot be realised until all work is delivered.

Do you have a trust between the IT organisation and the business which is built upon a rhythm of regularly delivered work?

How can Service Management teams get that sense of rhythm?

I love the idea of working within constraints. It focuses the mind and makes people be creative. Even if you don’t work in software engineering define a series of 2 week “sprints” for your Service Management team.

Declare an objective for the two week sprint – “we are going to reduce the incident backlog to under 50”. Let the team self-organise and think about your teams objective for the next sprint.

In summary

Thanks for Tobias for his 5 attributes of Agile teams that I’ve expanded and commented upon. My aim here was to outline the benefits of Agile to teams outside of the world of software development. I hope that readers that work in IT Operations and engineering can compare the way they work currently against these ideals – all of which are simple and cheap to implement and realise.

Ultimately the ideas of focus, alignment, artful making, self-organisation and rhythm promotes a culture of learning – about the work you handle, about how the team performs and how you interact with the business.

Combine these 5 principles with the idea of regular, structured retrospection and I think you are well on the way to having a highly performing team.

I would love to have a discussion with you in the comments or on Twitter. Or come to the Kamu Google+ group and discuss with your peers there.

Image Credit 1

Image Credit 2

Simple steps towards Agility and Service Management improvement

Dead as a...

There have been many hundreds of words recently written on the subject of Agile Development and IT Operations practices. For the average ITSM practitioner, however, a life where both are interwoven into the organisations day-to-day work seems as unattainable as ever.

Sure, you might work for one of the few organisations that practices DevOps. If so congratulations… you’re one of the cool kids. Maybe you picked up a copy of “The Phoenix Project“** and the authors words resonated with you.

“I should start introducing Agile and Lean concepts into my IT organisation”

It’s not as if these words have fallen on deaf ears as such – it’s just that most ITSM practitioners are struggling to join the dots in their head, not even able to mentally apply Agile/Lean/DevOps to their own environments.

It’s hard to see how you get from your current position today to a position of continuous delivery and business agility, along with the bragging rights on Twitter about how great your aligned development and IT Operations organisations are.

You now want to improve… So what can you do to get started?

I have two quick tips for those IT Operations folk that want to start taking steps towards Agility and Service Management improvement. These tips won’t transform your IT department overnight but they are both cheap and easy to implement (in fact you could do it this week).

Tip number 1: Hold retrospectives

The most valuable skill of a good Agile team is the ability to self-learn. Self-learners have a habit of looking at their performance as a team and can identify positive and negative characteristics from their recent behaviour. By learning from past experiences they pledge to improve in the future.

The mechanism for Agile teams to drive improvements is to hold regular retrospectives.

A retrospective is a time boxed activity (a meeting) that is held at the end of a period of work, or in Agile-speak an “iteration”.

Development teams often work in regular short bursts of work called “sprints”, which in my company are always two weeks long, therefore we hold retrospectives on the last day of each sprint.

IT Operations work is not normally neatly defined in two week iterations – you tend to deal with KTLO work (Keep the lights on – Incidents and Problems) and perhaps projects. However, you should avoid the habit of only holding retrospectives to find improvements at the end of projects or when things are going wrong.

If you want to take a few Agile steps in your IT Organisation my advice is that you open your calendar application right now and setup a recurring meeting for your team that lasts for an hour every two weeks. Take this time to review work from that two week period and identify improvements.

Build self-learning and improvement sessions into your schedule. Don’t leave opportunities for improvements to project post-mortems or to when things have already gone wrong.

So what happens in a retrospective session?

Firstly, it should be a facilitated session so you’ll need someone to lead the team, but this isn’t a daunting task (OK – it is the first time you do it but it gets easier after that). Secondly, it’s a structured session rather than an hour to ‘bitch and moan’ about the Incidents that came in during the last two weeks.

Retrospectives are structured meetings with a clear objective – not a general conversation about performance

The objective of a retrospective is to get a documented commitment from the team to change one or two aspects of their behaviour. Documenting these commitments is covered below in tip number two.

Changing the behaviour of a team is absolutely not as challenging as it first seems, people only need a few things to happen to change their behaviour: to have their opinion heard; to be able to commit to the change; and to be held accountable. The format of a retrospective allows for all of this.

Also with retrospectives we don’t focus purely on examples where things went wrong. I’ve been in many retrospective sessions where teams have focused on unexpected success, have researched the factors that contributed to that and committed to spreading whatever practice caused the success to a wider organisation.

Identifying what worked well for a team in the previous two weeks and pledging to repeat that behaviour is just as powerful as pledging not to repeat negative behaviours.

I mentioned that retrospective sessions are structured. This really helps, especially when a team starts out on a path of self-learning and improvements. The structure holds the meeting together and guides the team to its objective for the meeting – validation of existing working agreements and proposals for new working agreements.

Esther Derby and Diana Larsen, who both inspired me to focus on retrospectives with their book, “Agile Retrospectives: Making Good Teams Great“ describes the structure for retrospectives very well in the SlideShare presentation below. Take time to study and implement their meeting structures.

What should the meeting structure look like?

The recommended meeting structure is as follows:

  • Set the stage

  • Gather data

  • Generate insights

  • Decide what to do

  • Close the retrospective

Each element in the meeting agenda is an opportunity for the facilitator to engage the team and run exercises to uncover what worked well (to be repeated) and what did not work well (to be avoided).

By structuring the meeting and facilitating people through the process you avoid the temptation for people to use the time simply complaining and placing blame for things that didn’t go well.

The meeting structure drives the retrospective towards its objective – an actionable set of Working Agreements for the team to use.

Tip number 2: Use Working Agreements

In a previous role in IT Operations and support I often felt the sensation of “spinning plates”. As soon as we could put one fire out another would flare up. Our problems as a team were that different people worked in different ways which is a real problem in Infrastructure teams.

My solution at the time was to try and write an all-encompassing “rule book” which described how we as a team react to any given circumstance. We’d build this “rule book” up over time and end up with a comprehensive document to remove confusion on how to perform work.

I’m sure you can imagine the outcome – we started.. we didn’t get that far.. as soon as the rule book was of any decent size it became out of date and unwieldy.

What my team then really needed, and the way that my Agile development team now works, is to have a lightweight document explaining the rules of the road. We call this document our “Working Agreements”.

What should Working Agreements look like?

  • They should be small enough to fit on a single side of A3 paper

  • Agreed upon by the team

  • The output of retrospective sessions, worded to enforce good behaviour or to prevent negative behaviour

  • Should be reviewed during each retrospective – do we need this Working Agreement now or is it part of our standard behaviour.

  • Should be very visible in the area

Having a lightweight set of agreements that the team commit to and that are reviewed regularly are a great way to drive cultural and technical changes that actually stick! Rather than review meetings that mean nothing once the team leave the room.

In summary

Driving improvements to a team means you are trying to change peoples behaviour which is never an easy task. Teams will change if some basic needs are met. They need to be listened to, they need to commit to the change and they need to be held accountable for future behaviour.

This is possible in your IT Operations teams today – hold regular retrospectives to identify what works and what does not. Get the team to commit to working agreements which are agreed by the team, meaningful and visible.

Let the improvements commence!

** If you didn’t nod when I mentioned The Phoenix Project then you aren’t one of the cool kids and you better find out what it is… pronto!

Image Credit

ITSM User Personas

Persona: "the aspect of someone’s character that is presented to or perceived by others"

During my time in IT Service Management I’ve read my fair share of process and policy documentation. In fact, I think I’ve had the misfortune to read more than my fair share.

Process documentation is important, don’t get me wrong. Without someone taking the time to write down the intention, expected steps, outcome and quality checks within a module of work you don’t really have a process at all.

I was having a conversation this week with someone that was sure they had a process for something. It transpired that what they had was an unwritten set of best practices that everyone understood and followed. The outcome was actually fairly good and repeatable but by no sensible definition was this a process.

In fact the moment of realisation came when I asked if the best practices had changed over time. Of course they had as the team had learned and improved itself. But without proper documentation outlining the new steps to take nothing was truly repeatable. There was no real process.

Lean Process Documentation?

So, I am an advocate of writing documentation to support a process. But does it have to be so verbose, heavy in language, formal… and lets be honest. Boring?

Lean principles teach us how to identify “waste”. Any more than the responsible minimum is waste. Do we actually need to include difficult to read text in our process documentation? If they don’t add value to the reader surely they are wasteful and can be removed?

There are some practices that I use in my role in product development that I think would be really useful to IT Service Management professionals that are willing to take a fresh look at their process documentation.

User Personas in Action

A user persona is a method of representing an individual that is involved in a process. For example in your Change Management process you will have a number of different people to think about:

  • The person requesting the change
  • The person that peer reviews
  • The approver
  • The person implementing the change
  • The person who does a post implementation review

Each of these people have different requirements, concerns and objectives when working within the Change Management process. Does your process documentation accurately represent these people?

Each user persona should fit on a single side of A4 – An example, ‘Angie – Change Manager’ is below.

User personas represent real people within the process and I’d recommend using photos of your actual users. It’s a powerful thing in design meetings to have a set of personas pinned up on the wall and to actually see the faces of the people you are making decisions on behalf of.

Each persona has a short summary and then four sections.

  • What she’s thinking
  • What she’s hearing
  • What she’s saying
  • What she’s doing

The sections show the users concerns (“thinking”), the conversations that other people would start with her (“hearing”), the conversations she would start with others (“saying”) and her day-to- day actions within the process.

ITSM Caricatures

A persona should be a “caricature” of the different people that act the part within your process. You might seek out these people and interview them to discover what they are thinking, saying, hearing and doing. If you interview 4 different Change Managers and discover that they all have different concerns your “Angie the Change Persona” would be a representation of a person that has ALL of the concerns. You are aiming to discover the extreme points of view that these people exhibit and document them.

We try and use personas throughout our product development process – when we design a process defining these is a critical early step and we rely on them during Acceptance Testing to ensure we are getting feedback from all people.

Back2ITSM Personas

A few months ago I mentioned personas in the Back2ITSM Facebook group and got a good response. As a community resource I’d love to see a set of User Personas that cover all the roles in common ITSM processes. Imagine knowing that you need to write a new process for Configuration Management. Heading over to Back2ITSM to download a set of user personas for each role in the process would be a huge head start.
I’d love to see process documentation move away from mimicking legal documents with reams of dense text and move towards a more user centric representation of users requirements and concerns.

After all – your process affects real people. Lets find out who they are at the design stage and make the process more suitable for their needs.

Image Credit

A structured approach to problem solving

Those who have worked in IT Operations have a strong affinity with the skills of problem solving and troubleshooting. Although a huge amount of effort is taken to improve resiliency and redundancy of IT systems the ability to quickly diagnose the root cause of problems has never been more important and relevant.

IT Service Management has gone a long way towards making practices standardised and repeatable. For example you don’t want individual creative input when executing standard changes or fulfilling requests. Standard Operating Procedures and process manuals means that we expect our engineers and practioners to behave in predictable ways. Those reluctant to participate in these newly implemented processes might even complain all the fun has gone out of IT support.

A Home for Creative and Inquiring Minds?

However there is still a place for creative and inquiring minds in ITSM driven organisations. Complex systems are adept at finding new and interesting ways to break and stop functioning. Problem analysis still needs some creative input.

When I recruited infrastructure engineers into my team I was always keen to find good problem solvers. I’d find that some people were more naturally inclined to troubleshooting than others.

Some people would absolutely relish the pursuit of the cause of a difficult network or storage issue… thinking of possible causes, testing theories, hitting dead ends and starting again. They tackled problems with the mindset of a stereotypical criminal detective… finding clues, getting closer to the murder weapon, pulling network cables, tailing through the system log.

These kinds of engineers would rather puzzle over the debug output from their core switch than get stuck into the daily crossword. I’m sure if my HR manager let me medically examine these engineers I’d find that the underlying psychological brain activity and feeling of satisfaction would be very similar to crossword puzzlers and sudoku players. I was paying these guys to do the equivalent of the Guardian crossword 5 days a week.

Others would shy away from troubleshooting sticky problems. They didn’t like the uncertainty of being responsible for fixing a situation they knew little about. Or making decisions based on the loosest of facts.

They felt comfortable in executing routine tasks but lacked the capability to logically think through sets of symptoms and errors and work towards the root cause.

The problem I never solved

Working in a previous organisation I remember a particularly tricky problem. Apple computers running Microsoft PowerPoint would find that on a regular basis their open presentation would lock and stop them saving. Users would have to save a new version and rename the file back to its original name.

It was a typical niggling problem that rumbled on for ages. We investigated different symptoms, spent a huge amount of time running tests and debugging network traces. We rebuilt computers, tried moving data to different storage devices and found the root cause elusive. We even moved affected users between floors to rule out network switch problems.

We dedicated very talented people to resolving the problem and made endless promises of progress to our customers. All of which proved false as we remained unable to find the root cause of the problem.

Our credibility ran thin with that customer and we were alarmed to discover that our previous good record of creatively solving problems in our infrastructure was under threat.

What’s wrong with creative troubleshooting?

The best troubleshooters in your organisation share some common traits.

  • They troubleshoot based on their own experiences
  • They (probably) aren’t able to always rationalise the root cause before attempting to fix it

Making assumptions based on your experiences is a natural thing to do – of course as you learn skills and go through cycles of problem solving you are able to apply your learnings to new situations. This isn’t a negative trait at all.

However it does mean that engineers approach new problems with a potentially limited set of skills and experiences. To network engineers all problems look like a potentially loose cable.

Not being able to rationalise the root cause is a balance between intuition, backed up by evidence and research. Your troubleshooter will work towards the root cause and sometimes have hard evidence to confirm the cause.

“I can see this in the log… this is definitely the cause!”

But in some cases the cause might be suspected, but you aren’t able to prove anything until the fix is deployed.

Wrong decisions can be costly

Attempting the wrong fix is expensive in many ways, not least financially. It’s expensive in terms of time, user patience and most critically the credibility of IT to fix problems quickly.

Expert troubleshooters are able to provide rational evidence that confirm their root cause before a fix is attempted.

A framework is needed

As with a lot of other activities in IT a process or framework can aid troubleshooters to identify the root cause of problems quickly. In addition to providing quick isolation of the root cause, the framework I’m going to discuss can provide evidence as to why we are suggesting this as the root cause.

Using a common framework has other benefits. For example:

  1. To allow collaboration between teams – Complex infrastructure problems can span multiple functional areas. You would expect to find subject matter experts from across the IT organisation working together to resolve problems. Using a common framework in your organisation allows teams to collaborate on problems in a repeatable way. Should the network team have a different methodology for troubleshooting than the application support team?
  2. To bring additional resources into a situation – Often ownership of Problems will be handed between teams in functional or hierarchical escalation. External resources may be brought in to assist with the problem. Having a common framework allows individuals to quickly get an appraisal of the situation and understand the progress that has already been made.
  3. To provide a common language for problem solvers – Structured problem analysis techniques have their own terminology. Having shared understanding of “Problem Area”, “Root cause” and “Probable cause” will prevent mis-understandings and confusion during critical moments

The Kepner Tregoe Problem Analysis process

Kepner-Tregoe is a global management consultancy firm specialising in improving the efficiency of their clients.

The founders, Chuck Kepner and Ben Tregoe, were social scientists living in California in the 1950’s. Chuck and Ben studied the methods of problem solvers and managers and consolidated their research into process definitions.

Their history is an interesting one and a biography of the organisation is outside the scope of this blog post – but definitely worth researching.

One of the processes developed, curated and owned by Kepner-Tregoe, is Structured Problem Analysis, known as KT-PA.

KT-PA is used by hundreds of organisations to isolate problems and discover the root cause. It’s a framework used by problem solvers and troubleshooters to resolve issues and provide rational evidence that the investigation has discovered the correct cause.

Quick overview of the process

1. State the Problem

KT-PA begins with a clear definition of the Problem. A common mistake in problem analysis is a poor description of the problem, often leading to resources dedicated to researching symptoms of the problem rather than the issue itself.

Having a clear and accurate Problem Statement is critical to finding the root cause quickly. KT-PA provides guidance on identifying the correct object and it’s deviation.

A typical Problem Statement might be

Users of MyAccountingApplication are experiencing up to 2 second delays entering ledger information

This problem statement is explicit about the object (“Users of MyAccountingApplication”) and the deviation from normal service (“2 second delays entering ledger information”)

2. Specify the Problem

The process then defines how to specify the problem into Problem Areas. A Problem is specified in 4 dimensions and all should be considered. What, Where, When, Extent:

  1. What object has the deviation
  2. What is the deviation
  3. Where is the deviation on the object
  4. When did the deviation occur
  5. Extent of the deviation (How many deviations are occurring, What is the size of one deviation, Are the number of deviations increasing or decreasing)

The problem owner inspects the issue from these dimensions and documents his results. Results are recorded in the format of IS and IS NOT. Using the IS/IS NOT logical comparison starts to build a profile of the problem. Even at this early stage certain causes might become more apparent or less likely.

Already troubleshooters will be getting benefit from the process. The fact that the 2 second delay in the problem dimension of Where “IS Users in London” but “IS NOT Users in New York” is hugely relevant.

The fact that the delay occurs in entering ledger information but not reading ledger information is also going to help subject matter experts think about possible causes.

3. Distinctions and Changes

Having specified the problem and made logical comparisons as to where the problem IS and IS NOT each problem area the next step is to examine Distinctions and Changes.

Each answer to a specifying question is examined for Distinctions and Changes.

  • What is distinct about users in London when compared to users in New York. What is different about their network, connectivity, workstation build?
  • What has changed for users in London?
  • What is distinct about August 2012 when compared to July?
  • What changed around the 30th July?

As these questions are asked and discussed possible root causes should become apparent. These are logged for testing in the next step.

4. Testing the cause

The stage of testing the cause before confirmation is, for me, the most valuable step in the KT-PA process. It isn’t particularly hard to think of possible root causes to a problem. Thinking back to the “problem I never solved” we had many opinions on what the cause might be from different technical experts.

If we had used KT-PA with that problem we could have tested the cause against the problem specification to see how probable it is.

As an example lets imagine that during the Distinctions and Changes stage with our problem above 3 possible root causes were suggested

  • LAN connection issue with the switch the application server is connected to
  • The new anti-virus installation installed across the company in August is causing issues
  • Internet bandwidth in the London office is saturated

When each possible root cause is evaluated against the problem specification you are able to test it using the following question

“If LAN connection issue with the switch the application server is connected to is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”

This possible root cause doesn’t sound like a winner. If there were network connectivity issues with the server wouldn’t all users be affected?

“If The new anti-virus installation installed across the company in August is causing issues is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”

We came to this root cause because of a distinction and change in the WHEN problem dimension. In August a new version of anti-virus was deployed across the company? But this isn’t a probable root cause for the same reason that New York users aren’t affected

If Internet bandwidth in the London office is saturated is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”

So far this possible root cause sounds most probable. The cause can explain the dimension of WHERE. Does it also prove other dimensions of the problem.

“If Internet bandwidth in the London office is saturated is the true cause of the problem then how does it explain why First noticed in August 2012, not reported before 30th July”

Perhaps now we’d be researching Internet monitoring charts to see if the possible root cause can be confirmed.

The New Rational Manager

You might find my recommendation of a book published in 1965 as one of the most relevant Problem Management books I’ve read to be incredulous.

But I’m recommending it anyway.

The New Rational Manager

The New Rational Manager, written by Charles H Kepner and Benjamin B Tregoe is a must read for anyone that needs to solve problems, be they manufacturing, industrial, business or Information Technology.

It explains the process above in a readable way with great examples. I think the word “Computer” is mentioned once – this is not a book about modern technology – but it teaches the reader a process that can be applied to complex IT problems

In Summary

Problem Management and troubleshooting is a critical skill in ITSM and Infrastructure and Operations roles. Many talented troubleshooters make their reputation by applying creative, technical knowledge to a problem and finding the root cause.

Your challenge is harnessing that creativity into a process to make their success repeatable in your organisation and to reduce the risk of fixing the wrong root cause.

Sudoku Image Credit

7 Benefits of Using a Known Error Database (KEDB)

KEDB - a repository that describes all of the conditions in your IT systems that might result in an incident for your customers.

I was wondering – do you have a Known Error Database? And are you getting the maximum value out of it?

The concept of a KEDB is interesting to me because it is easy to see how it benefits end users. Also because it is dynamic and constantly updated.

Most of all because it makes the job of the Servicedesk easier.

It is true to say that an effective KEDB can both increase the quality and decrease the time for Incident resolution.

The Aim of Problem Management and the Definition of “The System”

One of the aims of Problem Management is to identify and manage the root causes of Incidents. Once we have identified the causes we could decide to remove these problems to prevent further users being affected.

Obviously this might be a lengthy process – replacing a storage device that has an intermittent fault might take some scheduling. In the meantime Problem Managers should be investigating temporary resolutions or measures to reduce the impact of the Problem for users. This is known as the Workaround.

When talking about Problem Management it helps to have a good definition of “Your System”. There are many possible causes of Incidents that could affect your users including:

  • Hardware components
  • Software components
  • Networks, connectivity, VPN
  • Services – in-house and outsourced
  • Policies, procedures and governance
  • Security controls
  • Documentation and Training materials

Any of these components could cause Incidents for a user. Consider the idea that incorrect or misleading documentation would cause an Incident. A user may rely on this documentation and make assumptions on how to use a service, discover they can’t and contact the Servicedesk.

This documentation component has caused an Incident and would be considered the root cause of the Problem

Where the KEDB fits into the Problem Management process

The Known Error Database is a repository of information that describes all of the conditions in your IT systems that might result in an incident for your customers and users.

As users report issues support engineers would follow the normal steps in the Incident Management process. Logging, Categorisation, Prioritisation. Soon after that they should be on the hunt for a resolution for the user.

This is where the KEDB steps in.

The engineer would interact with the KEDB in a very similar fashion to any Search engine or Knowledgebase. They search (using the “Known Error” field) and retrieve information to view the “Workaround” field.

The “Known Error”

The Known Error is a description of the Problem as seen from the users point of view. When users contact the Servicedesk for help they have a limited view of the entire scope of the root cause. We should use screenshot of error messages, as well as the text of the message to aid searching. We should also include accurate descriptions of the conditions that they have experienced. These are the types of things we should be describing in the Known Error field A good example of a Known Error would be:

When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form.

The error message reads “Javascript exception at line 123”

The Known Error should be written in terms reflecting the customers experience of the Problem.

The “Workaround”

The Workaround is a set of steps that the Servicedesk engineer could take in order to either restore service to the user or provide temporary relief. A good example of a Workaround would be:

To workaround this issue add the timesheet application to the list of Trusted sites

1. Open Internet Explorer 2. Tools > Options > Security Settings [ etc etc ]

The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Servicedesk should take to help the user, has multiple benefits – some more obvious than others.

Seven Benefits of Using a Known Error Database (KEDB)

  1. Faster restoration of service to the user – The user has lost access to a service due to a condition that we already know about and have seen before. The best possible experience that the user could hope for is an instant restoration of service or a temporary resolution. Having a good Known Error which makes the Problem easy to find also means that the Workaround should be quicker to locate. All of the time required to properly understand the root cause of the users issue can be removed by allowing the Servicedesk engineer quick access to the Workaround.
  2. Repeatable Workarounds – Without a good system for generating high-quality Known Errors and Workarounds we might find that different engineers resolve the same issue in different ways. Creativity in IT is absolutely a good thing, but repeatable processes are probably better. Two users contacting the Servicedesk for the same issue wouldn’t expect a variance in the speed or quality of resolution. The KEDB is a method of introducing repeatable processes into your environment.
  3. Avoid Re-work – Without a KEDB we might find that engineers are often spending time and energy trying to find a resolution for the same issue. This would be likely in distributed teams working from different offices, but I’ve also seen it commonly occur within a single team. Have you ever asked an engineer if they know the solution to a users issue to be told “Yes, I fixed this for someone else last week!”. Would you have prefered to have found that information in an easier way?
  4. Avoid skill gaps – Within a team it is normal to have engineers at different levels of skill. You wouldn’t want to employ a team that are all experts in every functional area and it’s natural to have more junior members at a lower skill level. A system for capturing the Workaround for complex Problems allows any engineer to quickly resolve issues that are affecting users.Teams are often cross-functional. You might see a centralised application support function in a head-office with users in remote offices supported by their local IT teams. A KEDB gives all IT engineers a single place to search for customer facing issues.
  5. Avoid dangerous or unauthorised Workarounds – We want to control the Workarounds that engineers give to users. I’ve had moments in the past when I chatted to engineers and asked how they fixed issues and internally winced at the methods they used. Disabling antivirus to avoid unexpected behaviour, upgrading whole software suites to fix a minor issue. I’m sure you can relate to this. Workarounds can help eliminate dangerous workarounds.
  6. Avoid unnecessary transfer of Incidents – A weak point in the Incident Management process is the transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else queue of work. Often with not enough detailed context or background information. Enabling the Servicedesk to resolve issues themselves prevents transfer of ownership for issues that are already known.
  7. Get insights into the relative severity of Problems – Well written Known Errors make it easier to associate new Incidents to existing Problems. Firstly this avoids duplicate logging of Problems. Secondly it gives better metrics about how severe the Problem is. Consider two Problems in your system. A condition that affects a network switch and causes it to crash once every 6 months. A transactional database that is running slowly and adding 5 seconds to timesheet entry You would expect that the first Problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core switch would be more urgent that a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly associate Incidents against existing Problems allows you to judge the relative impact of each one.

The KEDB implementation

Technically when we talk about the KEDB we are really talking about the Problem Management database rather than a completely separate store of data. At least a decent implementation would have it setup that way.

There is a one-to-one mapping between Known Error and Problem so it makes sense that your standard data representation of a Problem (with its number, assignment data, work notes etc) also holds the data you need for the KEDB.

It isn’t incorrect to implement this in a different way – storing the Problems and Known Errors in seperate locations, but my own preference is to keep it all together.

Known Error and Workaround are both attributes of a Problem

Is the KEDB the same as the Knowledge Base?

This is a common question. There are a lot of similarities between Known Errors and Knowledge articles.

I would argue that although your implementation of the KEDB might store its data in the Knowledgebase they are separate entities.

Consider the lifecycle of a Problem, and therefore the Known Error which is, after all, just an attribute of that Problem record.

The Problem should be closed when it has been removed from the system and can no longer affect users or be the cause of Incidents. At this stage we could retire the Known Error and Workaround as they are no longer useful – although we would want to keep them for reporting so perhaps we wouldn’t delete them.

Knowledgebase articles have a more permanent use. Although they too might be retired, if they refer to an application due to be decommissioned, they don’t have the same lifecycle as a Known Error record.

Knowledge articles refer to how systems should work or provide training for users of the system. Known Errors document conditions that are unexpected.

There is benefit in using the Knowledgebase as a repository for Known Error articles however. Giving Incident owners a single place to search for both Knowledge and Known Errors is a nice feature of your implementation and typically your Knowledge tools will have nice authoring, linking and commenting capabilities.

What if there is no Workaround

Sometimes there just won’t be a suitable Workaround to provide to customers.

I would use an example of a power outage to provide a simple illustration. With power disrupted to a location you could imagine that there would be disruption to services with no easy workaround.

It is perhaps a lazy example as it doesn’t allow for many nuances. Having power is a normally a binary state – you either have adequate power or not.

A better and more topical example can be found in the Cloud. As organisations take advantage of the resource charging model of the Cloud they also outsource control.

If you rely on a Cloud SaaS provider for your email and they suffer an outage you can imagine that your Servicedesk will take a lot of calls. However there might not be a Workaround you can offer until your provider restores service.

Another example would be the February 29th Microsoft Azure outage. I’m sure a lot of customers experienced a Problem using many different definitions of the word but didn’t have a viable alternative for their users.

In this case there is still value to be found in the Known Error Database. If there really is no known workaround it is still worth publishing to the KEDB.

Firstly to aid in associating new Incidents to the Problem (using the Known Error as a search key) and to stop engineers in wasting time in searching for an answer that doesn’t exist.

You could also avoid engineers trying to implement potentially damaging workarounds by publishing the fact that the correct action to take is to wait for the root cause of the Problem to be resolved.

Lastly with a lot of Problems in our system we might struggle to prioritise our backlog. Having the Known Error published to help routing new Incidents to the right Problem will bring the benefit of being able to prioritise your most impactful issues.

A users Known Error profile

With a populated KEDB we now have a good understanding of the possible causes of Incidents within our system.

Not all Known Errors will affect all users – a network switch failure in one branch office would be very impactful for the local users but not for users in another location.

If we understand our users environment through systems such as the Configuration Management System (CMS) or Asset Management processes we should be able to determine a users exposure to Known Errors.

For example when a user phones the Servicedesk complaining of an interruption to service we should be able to quickly learn about her configuration. Where she is geographically, which services she connects to. Her personal hardware and software environment.

With this information, and some Configuration Item matching the Servicedesk engineer should have a view of all of the Known Errors that the user is vulnerable to.

Measuring the effectiveness of the KEDB.

As with all processes we should take measurements and ensure that we have a healthy process for updating and using the KEDB.

Here are some metrics that would help give your KEDB a health check.

Number of Problems opened with a Known Error

Of all the Problem records opened in the last X days how many have published Known Error records?

We should be striving to create as many high quality Known Errors as possible.

The value of a published Known Error is that Incidents can be easily associated with Problems avoiding duplication.

Number of Problems opened with a Workaround

How many Problems have a documented Workaround?

The Workaround allows for the customer Incident to be resolved quickly and using an approved method.

Number of Incidents resolved by a Workaround

How many Incidents are resolved using a documented Workaround. This measures the value provided to users of IT services and confirms the benefits of maintaining the KEDB.

Number of Incidents resolved without a Workaround or Knowledge

Conversely, how many Incidents are resolved without using a Workaround or another form of Knowledge.

If we see Servicedesk engineers having to research and discover their own solutions for Incidents does that mean that there are Known Errors in the system that we aren’t aware of?

Are there gaps in our Knowledge Management meaning that customers are contacting the Servicedesk and we don’t have an answer readily available.

A high number in our reporting here might be an opportunity to proactively improve our Knowledge systems.

OLAs

want to ensure that Known Errors are quickly written and published in order to allow Servicedesk engineers to associate incoming Incidents to existing Problems.

One method of measuring how quickly we are publishing Known Errors is to use Organisational Level Agreements (or SLAs if your ITSM tool does’t define OLAs).

We should be using performance measurements to ensure that our Problem Management function is publishing Known Errors in a timely fashion.

You could consider tracking Time to generate Known Error and Time to generate Workaround as performance metrics for your KEDB process.

In summary

Additionally we could also measure how quickly Workarounds are researched, tested and published. If there is no known Workaround that is still valuable information to the Servicedesk as it eliminates effort in trying to find one so an OLA would be appropriate here.

How to Provide Support for VIPs

One of the outcomes of IT Service Management is the regulation, consistency and predictability in the delivery of services.

I remember working in IT before Service Management was adopted by our organisation and realising that we would over-service some customers and under-service others. Not intentionally but we didn’t have a way of regulating our work and making our output predicatable.

Our method of work delivery seemed to be somewhere between “First come first served” and “She who shouts loudest shall get the best service”. Not the best way to manage service delivery.

Chris York tweeted an interesting message recently;

It’s a great topic to talk about and one that I remember having to deal with personally in previous jobs.

I have two different views on VIP treatment – I think it’s a complex subject and I’d love to know your thoughts in the comments below.

if your names not down you're not getting support
if your names not down you're not getting support

The Purist

Firstly IT Service Management is supposed to define exactly how services will be delivered to an organisation. The service definition includes the cost, warranty and utility that is to be provided.

Secondly, there is a difference between the Customer of the service and the User of the service. The Customer is characterised as the people that pay for the service. They also define and agree the service levels.

Users are characterised as individuals that use the service.

There are loads of great analogys to reinforce this point – from local government services that are outsourced (The local Government is the customer, the local resident is the user), to restaurants and airports. The IT Skeptic has a good discussion on the subject

It’s also true to say that the Customer might not also be a user of the service, although in organisations I’ve worked in it is usually so.

This presents an interesting dilemma for both the Provider and the Customer. Should the Customer expect more from the service than they originally negotiated with the Provider? I think the most common example that this dilemma occurs is end-user services – desktop support.

The people that would “sign on the dotted line”for the IT Services we used to provide would be Finance Directors, IT Directors, CFOs or CIOs. Very senior people with responsibility for the cost of their services and making sure the company gets a good deal.

Should we be surprised when senior people that ultimately pay for the service expect preferential treatment? No – but we should remind them of the service warranty that they agreed would be supplied.

Over-servicing VIPs has to be at the cost of someone else – and by artificially raising the quality of service for a few people we risk degrading the service for everyone.

The Pragmatist

The reality is that IT Service Management is a people business and a perception business, especially end-user services.

People call the Service desk when they want something (a Request) or they need help (an Incident). Both of these are quite emotional human states.

The performance and usability of someones IT equipment is fundamental to their own productivity and their own success. It feels very personal when your equipment that you rely on stops functioning.

Although we can gather SLA and performance statistics for our stakeholder meetings we have the problem that we are often seen as being as good as our last experience with that individual person. It shouldn’t be this way – but it is.

I’ve been to meetings full of good news about the previous months service only to be ripped to pieces for a request submitted by the CEO that wasn’t actioned. I’ve been to meetings after a period of general poor service and had good reviews because the Customer had a (luckily) excellent experience with the Service desk.

Much as we don’t like it prioritising VIP support it has an overall positive effect when we do.

The middle ground (or “How I’ve seen it done before”)

If you don’t like the Pragmatist view above there are ways to come to a compromise. Stephen Mann touched on an idea I have seen before:

Deciding business criticality is obviously a challenge.

In my previous role, in the advertising world, the most important people in an agency are the Creatives.

These guys churn out graphical and video content and work on billable hours. When their equipment fails the clock is ticking to get them back up and running again.

So calculating the financial cost of individuals downtime and assigning a role is a method of designating those that can expect prioritised support.

As a Service Provider in that last role our customer base grew and our list of VIPs got longer. We eventually allocated 5% of each companies headcount to have “VIP” status in our ITSM tool.

I think there are ways to write VIP support into an IT Services contract that allows the provider to plan and scale their support to cater for it.

Lastly, we should talk about escalated Incidents. This is a more “formal” approach to Service Management (the Purist would be happy) where a higher level of service is allocated to resolving an Incident if it meets the criteria for being escalated.

When dealing with Users it is worth having a view of that persons overall experience with the Service Provider. If a user already has one escalated Incident should she expect a better service when she calls with another? Perhaps so – the Pragmatist would see that although we file each Incident separately her perception of the service is based on the overall experience. With our ITSM suite we use informational messages to guide engineers as to the overall status of a User.

Simon Morris
Simon Morris

In summary…

I think everyone would agree that VIP support is a pain.

The Purist will have to deal with the fact that although he kept his service consistent regardless of the seniority of the caller he might have to do some unnecessary justification at the next review meeting.

The Pragmatist will have to suffer unexpected drain on her resources when the CEOs laptop breaks and everything must be focussed on restoring that one users service.

Those occupying the middle ground will be controlling the number of VIPs by defining a percentage of headcount for the Customer to allocate. Hopefully the Customer will understand the business well enough to allocate them to the correct roles (and probably herself).

The Middle Ground will also be looking at a users overall experience and adjusting service to make sure that escalated issues are dealt with quickly.

No-one said IT Service Management was going to be easy!


Planning for Major Incidents

Do regular processes go out of the window during a Major Incident?

Recently I’ve been working on Incident Management, and specifically on Major Incident planning.

During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.

For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.

At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.

ITIL talks about Major Incident planning in a brief but fairly helpful way:

A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.

So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.

The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.

What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.

Working with a Major Incident

When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.

Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.

Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.

If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!

What is a Major Incident

It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.

Having this defined is an important step, but will vary between organisations.

Roughly speaking a generic definition of a Major Incident could be

  • An Incident affecting more than one user
  • An Incident affecting more than one business unit
  • An Incident on a device on a certain type – Core switch, access router, Storage Area Network
  • Complete loss of a service, rather than degregation

Is a P1 Incident a Major Incident?

No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.

Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.

Do I need a single Incident or multiple Incidents for logging a Major Incident?

This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.

The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.

If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.

Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.

Is a Major Incident a Problem?

No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.

Remember the intended outcome of the Incident and Problem Management processes:

  • Incident Management: The outcome is a restoration of service for the users
  • Problem Management: The outcome is the identification and possibly removal of the causes of Incidents

The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.

Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.

How can I measure my Major Incident Procedure?

Simon Morris

I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.

  • Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
  • The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
  • Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently

There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.

I hope that you found this article helpful.

Photo Credit

Interview: Simon Morris, 'Sneaking ITIL into the Business'

Ignoring the obvious may lead to a nasty mess

I found Simon Morris via his remarkably useful ITIL in 140 app. Simon recently joined ServiceNow from a FTSE100 Advertising, Marketing and Communications group. He was Head of Operations and Engineering and part of a team that lead the Shared Services IT organisation through its transition to IT Service Management process implementation. Here, Simon kindly shares his experiences of ITSM at the rock face.

ITSM Review: You state that prior to your ITSM transformation project you were ‘spending the entire time doing break-fix work and working yourselves into the ground with an ever-increasing cycle of work’. Looking back, can you remember any specific examples of what you were doing, that ITSM resolved?

Simon Morris:

Thinking back I can now see that implementing ITSM gave us the outcomes that we expected from the investment we made in time and money, as well as outcomes that we had no idea would be achieved. Because ITIL is such a wide-ranging framework I think it’s very difficult for organisations to truly appreciate how much is involved at the outset of the project.

We certainly had no idea how much effort would be spent overall on IT Service Management, but we able to identify results early on which encouraged us to keep going. By the time I left the organisation we had multiple people dedicated to the practice, and of course ITSM processes affect all engineering staff on a day-to-day basis.

As soon we finished our ITILv3 training we took the approach of selecting processes that we were already following, and adding layers of maturity to bring them into line with best practice.

I guess at the time we didn’t know it, but we started with Continual Service Improvement – looking at existing processes and identifying improvements. One example that I can recall is Configuration Management – with a very complex Infrastructure we previously had issues in identifying the impact of maintenance work or unplanned outages. The Infrastructure had a high rate of change and it felt impossible to keep a grip on how systems interacted, and depended on each other.

Using Change Management we were able to regulate the rate of change, and keep on top of our Configuration data. Identifying the potential impact of an outage on a system was a process that went from hours down to minutes.

Q. What was the tipping point? How did the ITSM movement gather momentum from something far down the to do list to a strategic initiative? 

If I’m completely honest we had to “sneak it in”! We were under huge pressure to improve the level of professionalism, and to increase the credibility of IT, but constructing the business case for a full ITSM transition was very hard. Especially when you factor in the cost of training, certification, toolsets and the amount of time spent on process improvement. As I said, at the point I left the company we had full time headcount dedicated to ITSM, and getting approval for those additional people at the outset would have been impossible.

We were lucky to have some autonomy over the training budget and found a good partner to get a dozen or so engineers qualified to ITILv3 Foundation level. At that point we had enough momentum, and our influence at departmental head level to make the changes we needed to.

One of the outcomes of our “skunkworks” ITIL transition that we didn’t anticipate at the time was a much better financial appreciation of our IT Services. Before the project we were charging our internal business units on a bespoke rate card that didn’t accurately reflect the costs involved in providing the service. Within a year of the training we had built rate cards that both reflected the true cost of the IT Service, but also included long term planning for capacity.

This really commoditised IT Services such as Storage and Backup and we were able to apportion costs accurately to the business units that consumed the services.

Measuring the cost benefit of ITSM is something that I think the industry needs to do better in order to convince leaders that it’s a sensible business decision – I’m absolutely convinced that the improvements we made to our IT recharge model offset a sizeable portion of our initial costs. Plus we introduced benefits that were much harder to measure in a financial sense such as service uptime, reduced incident resolution times and increased credibility.

Q. How did you measure you were on the right track? What specifically were you measuring? How did you quantify success to the boss? 

Referring back to my point that we started by reviewing existing processes that were immature, and then adding layers to them. We didn’t start out with process metrics, but we added that quite early on.

If I had the opportunity to start this process again I’d definitely start with the question of measurements and metrics. Before we introduced ITSM I don’t think we definitively knew where our problems were, although of course we had a good idea about Incident resolution times and customer satisfaction.

Although it’s tempting to jump straight into process improvement I’d encourage organisations at the start of their ITSM journey to spend time building a baseline of where they are today.

Surveys from your customers and users help to gauge the level of satisfaction before you start to make improvements (Of course, this is a hard measurement to take especially if you’ve never asked your users for honest feedback before, I’ve seen some pretty brutal survey responses in my time J)

Some processes are easier to monitor than others – Incident Management comes to mind, as one that is fairly easy to gather metrics on, Event Management is another.

I would also say that having survived the ITIL Foundation course it’s important to go back into the ITIL literature to research how to measure your processes – it’s a subject that ITIL has some good guidance on with Critical Success Factors (CSFs) and Key Performance Indicators (KPIs).

Q. What would you advise to other companies that are currently stuck in the wrong place, ignoring the dog? (See Simon’s analogy here). Is there anything that you learnt on your journey that you would do differently next time? 

Wow, this is a big question.

Business outcomes

My first thought is that IT organisations should remember that our purpose is to deliver an outcome to the business, and your ITSM deployment should reflect this. In the same way that running IT projects with no clear business benefit, or alignment to an overall strategy is a bad idea – we shouldn’t be implementing ITIL just for the sake of doing it.

For every process that you design or improve, the first question should be “What is the business outcome”, closely followed by “How am I going to prove that I delivered this outcome”. An example for Incident Management would be an outcome of “restoring access to IT services within an agreed timeframe”, so the obvious answer to the second question is “to measure the time to resolution.”

By analysing each process in this way you can get a clearer idea of what types of measurement you should take to:

  • Ensure that the process delivers value and
  • Demonstrate that value.

I actually think that you should start designing the process back-to-front. Identify the outcome, then the method of measurement and then work out what the process should be.

Every time I see an Incident Management form with hundreds of different choices for the category (Hardware, Software, Keyboard, Server etc.) I always wonder if the reporting requirements were ever considered. Or did we just add fields for the sake of it.

Tool maturity

Next I would encourage organisations to consider their process maturity and ITSM toolset maturity as 2 different factors. There is a huge amount of choice in the ITSM suite market at the moment (of course I work for a vendor now, so I’m entitled to have a bias!), but organisations should remember that all of vendors offer a toolset and nothing more.

The tool has to support the process that you design, and it’s far too easy to take a great toolset and implement a lousy process. A year into your transition to ITSM you won’t be able to prove the worth of the time and money spent, and you have the risk of the process being devalued or abandoned.

Having a good process will drive the choice of tool, and design decisions on how that tool is configured. Having the right toolset is huge factor in the chances of a successful transition to ITSM. I’ve lived through experiences with legacy, unwieldy ITSM vendors and it makes the task so much harder.

Participation at every level

One of the best choices we made when we transitioned to ITSM was that we trained a cross-section of engineers across the company. Of the initial group of people to go through ITILv3 Foundation training we had engineers from the Service desk, PC and Mac support, Infrastructure, Service Delivery Managers, Asset management staff and departmental heads.

The result was that we had a core of people who were motivated enough to promote the changes we were making all across the IT department at different levels of seniority. Introducing change, and especially changes that measure the performance of teams and individuals will always induce fear and doubt in some people.

Had we limited the ITIL training to just the management team I don’t think we would have had the same successes. My only regret is that our highest level of IT management managed to swerve the training – I’ll send my old boss the link to this interview to remind him of this!

Find the right pace

A transition to ITSM processes is a marathon, not a sprint so it’s important to find the right tempo for your organisation. Rather than throwing an unsustainable amount of resource at process improvement for a short amount of time I’d advise organisations to recognise that they’ll need to reserve effort on a permanent basis to monitor, measure and improve their services.

ITIL burnout is a very real risk.

 

Simon Morris

My last piece of advice is not to feel that you should implement every process on day one. I can’t think of one approach that would be more prone to failure. I’ve read criticism from ITSM pundits that it’s very rare to find a full ITILv3 implementation in the field. I think that says more about the breadth and depth of the ITIL framework than the failings of companies that implement it.

There’s an adage from the Free Software community – “release early, release often” that is great for ITSM process improvements.

By the time that I left my previous organisation we had iterated through 3 versions of Change Management, each time adding more maturity to the process and making incremental improvements.

I’d recommend reading “ITIL Lite, A road map to full or partial ITIL implementation” by Malcolm Fry. He outlines why ITILv3 might not be fully implemented and the reasons make absolute sense:

  • Cost
  • No customer support
  • Time constraints
  • Ownership
  • Running out of steam

IT Service Management is a cultural change, and it’s worth taking the time to alter peoples working habits gradually over time, rather than exposing them to a huge amount of process change quickly.

Q. Lastly, what do you do at ServiceNow?

I work as a developer in the Application Development Team in Richmond, London. We’re responsible for the ITSM and Business process applications that run on our Cloud platform. On a day-to-day basis this means reviewing our core applications (Incident, Problem, Change, CMDB) and looking for improvements based on customer requirements and best practice.

Obviously the recent ITIL 2011 release is interesting as we work our way through the literature and compare it against our toolset. Recently I’ve also been involved in researching how best to integrate Defect Management into our SCRUM product.

The sign that ServiceNow is growing at an amazing rate (we’re currently the second fastest growing tech company in the US) shows that ITSM is being taken seriously by organisations, and they are investing money to get the returns that a successful transition can offer. These should be encouraging signs to organisations that are starting their journey with ITIL.

@simo_morris
Beer and Speech
Photo Credit