An Incident is defined as any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of service. A simplified Incident Management work flow is provided in the figure below.
When an Incident is reported to the Service Desk, it attempts to resolve it by consulting the Known Error Database and the CMDB. If it is unsuccessful the Incident is classified and transferred to the Incident Management. Incident Management typically consists of first line support specialists who can resolve most of the common Incidents. When they are unable to do so, they will quickly escalate it to the second line support team and the process continues until the Incident is resolved. As per its charter, Incident Management tries to find a quick resolution to the Incident so that the Service degradation or downtime is minimized.
So why is it hard
There are several factors that make Incident Management one of the most difficult and expensive of all the ITIL processes. This is by no means an exhaustive list. Please feel free to add to it by commenting below.
Complex System Architecture
Over the last 60 years, the IT industry has seen breakneck growth. IT services have evolved to meet increasingly sophisticated and complex business demands. A typical IT service today includes the following:
- One or more servers or virtual machines
- SAN storage
- Network components
- Backup servers
- Hypervisor (if virtualized)
- Operating system
- One or more databases
- One or more web servers
- One or more application servers
- Load balancing servers
- Monitoring software
- Interfaces to internal and external services
In the above we are not even talking about Business Continuity which adds its own layers. This results in a complex architecture which is difficult to understand and manage. What’s more, the architecture is often not documented adequately and is rarely up to date.
Poorly architected or missing processes
In addition to inadequate documentation, many IT departments do not have processes to manage their IT service. This results in ad-hoc and sometimes unauthorized changes resulting in cascading effects.
Silo effect caused by super specialization among IT professionals
As a result of complex architectures super specialists are becoming necessary to manage them. This creates silos in which super specialists operate with specialist jargon that is only comprehensible within their silos but not elsewhere. When serious incidents are reported, it is not uncommon to find half a dozen domain experts spending valuable time on swat calls.
Incomplete monitoring of processes and systems
For a variety of reasons, not all of the processes and systems that belong to an IT Service are monitored. While there seems to be no alternative to this because of cost and resource issues, it results in blind spots. An unmonitored Incident in one stack may result in an unpredictable incident in another, but may take a long time to diagnose because no one is aware of the original causal incident.
Lessons learned do not propagate
Even though domain experts may have excellent troubleshooting skills, once a difficult Incident has been resolved, they do not always have the tools to spread the knowledge. Search engines have reduced this problem somewhat by providing tag based searches. Complex Incidents that have multiple or cascading root causes cannot easily be captured in a community knowledge base. This results in frequent re-inventing of the wheel.
Missing or unclear context in exception handling
IT hardware and software are often developed in an environment that is far removed from the ecosystems where they eventually end up. When exceptions do occur, the exception handlers usually do not understand the context and therefore do not provide a comprehensible explanation.
There are many other reasons why Incident Management remains hard. There is a tendency to throw resources at Incidents when the underlying cause is poorly architected software, infrastructure or business process. Insufficient attention is paid to training IT professionals in troubleshooting which remains an art form. Finally it is getting more and more expensive to hire trained professionals with IT budgets shrinking.
Better automation and autonomics provide some relief to the Incident Management process. A cost effective network management platform from Rustyice Solutions can help alleviate some of these all too common stresses and strains that are a part of the story all across the industry. Call us today to arrange a chat with one of our Network Management Specialists.