SERVICE DESK 2.0 -The Service Desk is dead…long live the Service Desk!

Service Desk 2.0
Service Desk 2.0: More about services, products and capabilities, less about incidents and fixes.

We all know the world of IT is developing at a frightening pace.

Has Service Management been left in the dust?

I recently corresponded with Aale Roos,  ITSM Consultant and founder at Pohjoisviitta Oy, who argues that the old perception of the Service Desk has to be replaced with a new way of thinking.

Q. The ITSM Review: Aale, could you tell me a bit about yourself ?

In 1989 I left my job at a Computercenter to become an ITSM consultant (We called it Data Processing Management Consulting back then, the company was called DPMC Oy)

In 1992 I started Help Desk Institute in Finland. By 2002 I was completely bored with help desks but saw that ITIL was coming and went into ITIL training and consulting.

Then in 2007 I thought that ITIL V3 was a big mistake and concentrated in ISO 20,000 instead.

Today I see a renewed interest in support but I think that ITIL is way behind. People don’t want to hear the same old stuff.

Q. What led to your Service Desk 2.0 Concept?

There are three major reasons why the good old Service Desk model is fast becoming obsolete:

1. Users Got IT Savvy

The concept of a Single Point Of Contact (SPOC) was a great innovation. Instead of having several numbers like PC support, operators, telecommunications etc. to call, the IT end users were given one single number and a promise that they would get help. The model was a great success; it was a major improvement to the previous situation, both for IT and the users of IT services.

There is nothing wrong with the SPOC model itself, it works fine if there is a fairly homogenous group of customers who have the same problems. This happens when people are confronting something complicated which is new to them and the 20/80 rule works; if you can solve 20% of issues, you can solve 80% of end user calls. That was the situation with IT before year 2000. It was new and complicated and people had repeating problems that were relatively easy to solve.

Today almost everyone is used to IT and can solve simple problems themselves. People are not afraid of computers like they were in the 1980’s when this model was invented.  There is no homogenous group of users with easy problems, users are different and their problems and needs are more specific.

2. Diversity vs. Standardization

The second major change is the technology. There have been two major waves of computing and a third is emerging: first the central computing with mainframes, then the personal computing with PC’s and now the consumer computing with iPads, Apps and Cloud. One of the key concepts in ITSM is standardization. Support and maintenance is much easier if users have standard equipment. BYOD is an anathema to this but is becoming reality. People use the tools they want to use and now consumer products are overtaking corporate IT. It is hard to support something you do not know.

3. Paradigm Shift in Support

The third change is the real game breaker. The whole Service Desk / incident/ problem -thinking is based on the assumption that technology malfunctions but is easy to fix. There must be one person per x hundreds of users. This model would not work with consumer services where one person can support a million users. FaceBook has 845.000.000 users and 3.000 staff. I would be surprised if more than 845 of them would be doing support, probably less. WordPress has 20 million customers and 10 happiness Engineers to support them.

The only way to support millions of users with one person is to make products and services robust, reliable and easy to use and that is exactly what has happened.

Aale Roos

Q. What does Service Desk 2.0 mean in practice?

Do we still need a service desk then? Yes we do but it has to change. The old ITIL Service Desk is like the old Service station including as garage. Handy if your car broke down. The new service station does not fix cars but sells food. Handy if you are hungry.

The new Service Desk 2.0 is like the Applestore. It is not about incidents and fixes, it is about services and products. Or maybe it is really about capabilities, Service desk 2.0 strives to give you better tools.

The new model plays down the SPOC model. Yes, there is a number but it is ok to contact the expert direct. They key is service, not incidents. Self service and peer support are important. SD2 is the place for new solutions. Feedback is also important, SD2 listens to the customers and drives service improvements.

Q. There seems to be increased interest in Service Catalogue – in this the answer to swapping the focus from call volumes to services and perceived value?

Yes, exactly.

Q. What key steps would you recommend for embracing the requirements of the modern day service desk?

  1. Learn to use new tools and keep up with you front runner customers.
  2. Be active in sharing new solutions.
  3. Be visible in social media
  4. Understand that peer to peer support happens

Aale Roos is an ITSM Consultant and founder at Pohjoisviitta Oy.

See also:

7 Benefits of Using a Known Error Database (KEDB)

KEDB - a repository that describes all of the conditions in your IT systems that might result in an incident for your customers.

I was wondering – do you have a Known Error Database? And are you getting the maximum value out of it?

The concept of a KEDB is interesting to me because it is easy to see how it benefits end users. Also because it is dynamic and constantly updated.

Most of all because it makes the job of the Servicedesk easier.

It is true to say that an effective KEDB can both increase the quality and decrease the time for Incident resolution.

The Aim of Problem Management and the Definition of “The System”

One of the aims of Problem Management is to identify and manage the root causes of Incidents. Once we have identified the causes we could decide to remove these problems to prevent further users being affected.

Obviously this might be a lengthy process – replacing a storage device that has an intermittent fault might take some scheduling. In the meantime Problem Managers should be investigating temporary resolutions or measures to reduce the impact of the Problem for users. This is known as the Workaround.

When talking about Problem Management it helps to have a good definition of “Your System”. There are many possible causes of Incidents that could affect your users including:

  • Hardware components
  • Software components
  • Networks, connectivity, VPN
  • Services – in-house and outsourced
  • Policies, procedures and governance
  • Security controls
  • Documentation and Training materials

Any of these components could cause Incidents for a user. Consider the idea that incorrect or misleading documentation would cause an Incident. A user may rely on this documentation and make assumptions on how to use a service, discover they can’t and contact the Servicedesk.

This documentation component has caused an Incident and would be considered the root cause of the Problem

Where the KEDB fits into the Problem Management process

The Known Error Database is a repository of information that describes all of the conditions in your IT systems that might result in an incident for your customers and users.

As users report issues support engineers would follow the normal steps in the Incident Management process. Logging, Categorisation, Prioritisation. Soon after that they should be on the hunt for a resolution for the user.

This is where the KEDB steps in.

The engineer would interact with the KEDB in a very similar fashion to any Search engine or Knowledgebase. They search (using the “Known Error” field) and retrieve information to view the “Workaround” field.

The “Known Error”

The Known Error is a description of the Problem as seen from the users point of view. When users contact the Servicedesk for help they have a limited view of the entire scope of the root cause. We should use screenshot of error messages, as well as the text of the message to aid searching. We should also include accurate descriptions of the conditions that they have experienced. These are the types of things we should be describing in the Known Error field A good example of a Known Error would be:

When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form.

The error message reads “Javascript exception at line 123”

The Known Error should be written in terms reflecting the customers experience of the Problem.

The “Workaround”

The Workaround is a set of steps that the Servicedesk engineer could take in order to either restore service to the user or provide temporary relief. A good example of a Workaround would be:

To workaround this issue add the timesheet application to the list of Trusted sites

1. Open Internet Explorer 2. Tools > Options > Security Settings [ etc etc ]

The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Servicedesk should take to help the user, has multiple benefits – some more obvious than others.

Seven Benefits of Using a Known Error Database (KEDB)

  1. Faster restoration of service to the user – The user has lost access to a service due to a condition that we already know about and have seen before. The best possible experience that the user could hope for is an instant restoration of service or a temporary resolution. Having a good Known Error which makes the Problem easy to find also means that the Workaround should be quicker to locate. All of the time required to properly understand the root cause of the users issue can be removed by allowing the Servicedesk engineer quick access to the Workaround.
  2. Repeatable Workarounds – Without a good system for generating high-quality Known Errors and Workarounds we might find that different engineers resolve the same issue in different ways. Creativity in IT is absolutely a good thing, but repeatable processes are probably better. Two users contacting the Servicedesk for the same issue wouldn’t expect a variance in the speed or quality of resolution. The KEDB is a method of introducing repeatable processes into your environment.
  3. Avoid Re-work – Without a KEDB we might find that engineers are often spending time and energy trying to find a resolution for the same issue. This would be likely in distributed teams working from different offices, but I’ve also seen it commonly occur within a single team. Have you ever asked an engineer if they know the solution to a users issue to be told “Yes, I fixed this for someone else last week!”. Would you have prefered to have found that information in an easier way?
  4. Avoid skill gaps – Within a team it is normal to have engineers at different levels of skill. You wouldn’t want to employ a team that are all experts in every functional area and it’s natural to have more junior members at a lower skill level. A system for capturing the Workaround for complex Problems allows any engineer to quickly resolve issues that are affecting users.Teams are often cross-functional. You might see a centralised application support function in a head-office with users in remote offices supported by their local IT teams. A KEDB gives all IT engineers a single place to search for customer facing issues.
  5. Avoid dangerous or unauthorised Workarounds – We want to control the Workarounds that engineers give to users. I’ve had moments in the past when I chatted to engineers and asked how they fixed issues and internally winced at the methods they used. Disabling antivirus to avoid unexpected behaviour, upgrading whole software suites to fix a minor issue. I’m sure you can relate to this. Workarounds can help eliminate dangerous workarounds.
  6. Avoid unnecessary transfer of Incidents – A weak point in the Incident Management process is the transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else queue of work. Often with not enough detailed context or background information. Enabling the Servicedesk to resolve issues themselves prevents transfer of ownership for issues that are already known.
  7. Get insights into the relative severity of Problems – Well written Known Errors make it easier to associate new Incidents to existing Problems. Firstly this avoids duplicate logging of Problems. Secondly it gives better metrics about how severe the Problem is. Consider two Problems in your system. A condition that affects a network switch and causes it to crash once every 6 months. A transactional database that is running slowly and adding 5 seconds to timesheet entry You would expect that the first Problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core switch would be more urgent that a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly associate Incidents against existing Problems allows you to judge the relative impact of each one.

The KEDB implementation

Technically when we talk about the KEDB we are really talking about the Problem Management database rather than a completely separate store of data. At least a decent implementation would have it setup that way.

There is a one-to-one mapping between Known Error and Problem so it makes sense that your standard data representation of a Problem (with its number, assignment data, work notes etc) also holds the data you need for the KEDB.

It isn’t incorrect to implement this in a different way – storing the Problems and Known Errors in seperate locations, but my own preference is to keep it all together.

Known Error and Workaround are both attributes of a Problem

Is the KEDB the same as the Knowledge Base?

This is a common question. There are a lot of similarities between Known Errors and Knowledge articles.

I would argue that although your implementation of the KEDB might store its data in the Knowledgebase they are separate entities.

Consider the lifecycle of a Problem, and therefore the Known Error which is, after all, just an attribute of that Problem record.

The Problem should be closed when it has been removed from the system and can no longer affect users or be the cause of Incidents. At this stage we could retire the Known Error and Workaround as they are no longer useful – although we would want to keep them for reporting so perhaps we wouldn’t delete them.

Knowledgebase articles have a more permanent use. Although they too might be retired, if they refer to an application due to be decommissioned, they don’t have the same lifecycle as a Known Error record.

Knowledge articles refer to how systems should work or provide training for users of the system. Known Errors document conditions that are unexpected.

There is benefit in using the Knowledgebase as a repository for Known Error articles however. Giving Incident owners a single place to search for both Knowledge and Known Errors is a nice feature of your implementation and typically your Knowledge tools will have nice authoring, linking and commenting capabilities.

What if there is no Workaround

Sometimes there just won’t be a suitable Workaround to provide to customers.

I would use an example of a power outage to provide a simple illustration. With power disrupted to a location you could imagine that there would be disruption to services with no easy workaround.

It is perhaps a lazy example as it doesn’t allow for many nuances. Having power is a normally a binary state – you either have adequate power or not.

A better and more topical example can be found in the Cloud. As organisations take advantage of the resource charging model of the Cloud they also outsource control.

If you rely on a Cloud SaaS provider for your email and they suffer an outage you can imagine that your Servicedesk will take a lot of calls. However there might not be a Workaround you can offer until your provider restores service.

Another example would be the February 29th Microsoft Azure outage. I’m sure a lot of customers experienced a Problem using many different definitions of the word but didn’t have a viable alternative for their users.

In this case there is still value to be found in the Known Error Database. If there really is no known workaround it is still worth publishing to the KEDB.

Firstly to aid in associating new Incidents to the Problem (using the Known Error as a search key) and to stop engineers in wasting time in searching for an answer that doesn’t exist.

You could also avoid engineers trying to implement potentially damaging workarounds by publishing the fact that the correct action to take is to wait for the root cause of the Problem to be resolved.

Lastly with a lot of Problems in our system we might struggle to prioritise our backlog. Having the Known Error published to help routing new Incidents to the right Problem will bring the benefit of being able to prioritise your most impactful issues.

A users Known Error profile

With a populated KEDB we now have a good understanding of the possible causes of Incidents within our system.

Not all Known Errors will affect all users – a network switch failure in one branch office would be very impactful for the local users but not for users in another location.

If we understand our users environment through systems such as the Configuration Management System (CMS) or Asset Management processes we should be able to determine a users exposure to Known Errors.

For example when a user phones the Servicedesk complaining of an interruption to service we should be able to quickly learn about her configuration. Where she is geographically, which services she connects to. Her personal hardware and software environment.

With this information, and some Configuration Item matching the Servicedesk engineer should have a view of all of the Known Errors that the user is vulnerable to.

Measuring the effectiveness of the KEDB.

As with all processes we should take measurements and ensure that we have a healthy process for updating and using the KEDB.

Here are some metrics that would help give your KEDB a health check.

Number of Problems opened with a Known Error

Of all the Problem records opened in the last X days how many have published Known Error records?

We should be striving to create as many high quality Known Errors as possible.

The value of a published Known Error is that Incidents can be easily associated with Problems avoiding duplication.

Number of Problems opened with a Workaround

How many Problems have a documented Workaround?

The Workaround allows for the customer Incident to be resolved quickly and using an approved method.

Number of Incidents resolved by a Workaround

How many Incidents are resolved using a documented Workaround. This measures the value provided to users of IT services and confirms the benefits of maintaining the KEDB.

Number of Incidents resolved without a Workaround or Knowledge

Conversely, how many Incidents are resolved without using a Workaround or another form of Knowledge.

If we see Servicedesk engineers having to research and discover their own solutions for Incidents does that mean that there are Known Errors in the system that we aren’t aware of?

Are there gaps in our Knowledge Management meaning that customers are contacting the Servicedesk and we don’t have an answer readily available.

A high number in our reporting here might be an opportunity to proactively improve our Knowledge systems.

OLAs

want to ensure that Known Errors are quickly written and published in order to allow Servicedesk engineers to associate incoming Incidents to existing Problems.

One method of measuring how quickly we are publishing Known Errors is to use Organisational Level Agreements (or SLAs if your ITSM tool does’t define OLAs).

We should be using performance measurements to ensure that our Problem Management function is publishing Known Errors in a timely fashion.

You could consider tracking Time to generate Known Error and Time to generate Workaround as performance metrics for your KEDB process.

In summary

Additionally we could also measure how quickly Workarounds are researched, tested and published. If there is no known Workaround that is still valuable information to the Servicedesk as it eliminates effort in trying to find one so an OLA would be appropriate here.

Planning for Major Incidents

Do regular processes go out of the window during a Major Incident?

Recently I’ve been working on Incident Management, and specifically on Major Incident planning.

During my time in IT Operations I saw teams handle Major Incidents in a number of different ways. I actually found that in some cases all process and procedure went out of the window during a Major Incident, which has a horrible irony about it. Logically it would seem that this is the time that applying more process to the situation would help, especially in the area of communications.

For example in an organisation I worked in previously we had a run of Storage Area Network outages. The first couple caused absolute mayhem and I could see people pushing back against the idea of breaking out the process-book because all that mattered was finding the technical fix and getting the storage back up and running.

At the end of the Incident, once we’d restored the service we found that we, maybe unsurprisingly had a lot of unhappy customers! Our retrospective on that Incident showed us that taking just a short time at the beginning of the outage to sort out our communications plan would have helped the users a lot.

ITIL talks about Major Incident planning in a brief but fairly helpful way:

A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system – such that they will be dealt with through the major incident process.

So, the first thing to note is that we don’t need a separate ITIL process for handling Major Incidents. The aim of the Incident Management process is to restore service to the users of a service, and that outcome suits us fine for Major Incidents too.

The Incident model, its categories and states ( New > Work In Progress > Resolved > Closed ) all work fine, and we shouldn’t be looking to stray too far from what we already have in terms of tools and process.

What is different about a Major Incident is that both the urgency and impact of the Incident are higher than a normal day-to-day Incident. Typically you might also say that a Major Incident affects multiple customers.

Working with a Major Incident

When working on a Major Incident we will probably have to think about communications a lot more, as our customers will want to know what is going on and rough timings for restoration of service.

Where a normal Incident will be handled by a single person (The Incident Owner) we might find that multiple people are involved in a Major Incident – one to handle the overall co-ordination for restoring service, one to handle communications and updates and so on.

Having a named person as a point of contact for users is a helpful trick. In my experience the one thing that users hate more than losing their service is not knowing when it will be restored, or receiving confusing or conflicting information. With one person responsible for both the technical fix and user communications this is bound to happen – split those tasks.

If your ITSM suite has functionality for a news ticker, or a SocialIT feed it might be a good idea to have a central place to update customers about the Major Incident you are working on. If you run a service for the paying public you might want to jump onto Twitter to stop the Twitchfork mob discussing your latest outage without you being part of the conversation!

What is a Major Incident

It is up to each organisation to clearly define what consitutes a Major Incident. Doing so is important, otherwise the team won’t know under what circumstances to start the process. Or you might find that without clear guidance a team will treat a server outage one week as Major (with excellent communciations) and not the next week with poor communications.

Having this defined is an important step, but will vary between organisations.

Roughly speaking a generic definition of a Major Incident could be

  • An Incident affecting more than one user
  • An Incident affecting more than one business unit
  • An Incident on a device on a certain type – Core switch, access router, Storage Area Network
  • Complete loss of a service, rather than degregation

Is a P1 Incident a Major Incident?

No, although I would say that every Major Incident would be a P1. An urgent Incident affecting a single user might not be a Major Incident, especially if the Incident has a documented workaround or can be fixed straightaway.

Confusing P1 Incidents with Major Incidents would be a mistake. Priority is a calculation of Impact and Urgency, and the Major Incident plan needs to be reserved for the absolute maximum examples of both, and probably where the impact is over multiple users.

Do I need a single Incident or multiple Incidents for logging a Major Incident?

This question might depend on your ITSM toolset, but my preference is to open a separate Incident for each user affected in the Incident when they contact the Servicedesk.

The reason for this is that different users will be impacted in different ways. A user heading off to a sales pitch will have different concerns to a user just about to go on holiday for 2 weeks. We might want to apply different treatment to these users (get the sales pitch user some sort of service straight away) and this becomes confusing when you work in a single Incident record.

If you have a system of Hierarchical escalation you might find that one customer would escalate the Major Incident (to their sales rep for example) where another customer isn’t too bothered because they use the affected service less frequently.

Having an Incident opened for each user/customer allows you to judge exactly the severity of the Incident. The challenge then becomes to manage those Incidents easily, and be able to communicate consistently with your customers.

Is a Major Incident a Problem?

No, although if we didn’t have a Problem record open for this Major Incident I think we should probably do so.

Remember the intended outcome of the Incident and Problem Management processes:

  • Incident Management: The outcome is a restoration of service for the users
  • Problem Management: The outcome is the identification and possibly removal of the causes of Incidents

The procedure is started when an Incident matches our definition of a Major Incident. It’s outcome is to restore service and to handle the communication with multiple affected users. That restoration of service could come from a number of different sources – The removal of the root cause, a documented Workaround or possibly we’ll have to find a Workaround.

Whereas the Major Incident plan and Problem Management process will probably work closely together it is not true to say that a Major Incident IS a Problem.

How can I measure my Major Incident Procedure?

Simon Morris

I have some metrics for measuring the Major Incident procedure and I’d love to know your thoughts in the comments for this article.

  • Number of Incidents linked to a Major Incident: Where we are creating Incidents for each customer affected by a Major Incidents we should be able to measure the relative impact of each occurance.
  • The number of Major Incidents: We’d like to know how often we invoke the Major Incident plan
  • Mean Time Between Major Incidents: How much time elapses between Major Incidents being logged. This would be interesting in an organisation with service delivery issues, and they would hope to see Major Incidents happen less frequently

There you go. In summary handling Major Incidents isn’t a huge leap from the method that you use to handle day-to-day Incidents. It requires enhanced communciation and possibly measurement.

I hope that you found this article helpful.

Photo Credit