This document defines the Incident Management Process. Incident management is the most crucial process in ITSM process implementations. The process is based on the ITSM best practices and can be modified to reflect requirements specific to your organization. This incident management document may also interest IT staff members who execute a specific role within this process and business organizations that want to understand better how the process is defined within the IT organization.
Common examples of incidents are:
- Network server slow or network not accessible
- File server not accessible
- Emails not receiving or sending
Incident Management Process Steps
Incident management needs to be a structured approach managing an incident. Every step in the process should have a clearly defined purpose. This section presents the visual representation and explanation of incident management activities, their respective roles, how an incident is triggered, how it's prioritized and categorized, how investigation and diagnosis are made, how the tickets are handled with 3rd party vendors, resolution, and closure.
- Registration: Once an Incident gets detected, the details are logged in ITSM to raise an incident ticket. The service Desk will refer to KEDB to check whether it is a known error/ issue or not.
- Categorisation: Assigning the category, type, and item (CTI) to allow the correct assignment of the ticket.
- Assignment: Assign the Incident to the appropriate resolution group. The assignment is based on the categorisation of the Incident.
- Diagnosis: Process using which the incident resolution team starts investigating the issue at hand. Diagnosis could involve reviewing systems logs, looking at user errors, network configuration, etc.
- Resolution: When the root cause of the incident is found and fixed, the incident can be marked as resolved. Sending comms to affected users to inform them the incident is resolved is part of the resolution step.
- Closure: Incident closure generally involves conducting a PIR (Post Implementation Review) with the stakeholders and identifying the root cause. The critical takeaway during the closure process are - lessons learned, action items, and residual risks.
Incident Management Process Flow
The incident management process flow is a clear set of steps for each action to be taken. The process flow considers three significant groups of people involved in the whole process: Service Desk, L2 support, and L3 support.
The service desk is the customer-facing group doing the primary job like logging an incident, categorizing, triage the incident to see if it major incident, check if a vendor is involved, etc. If the incident needs further detailed analysis the service desk assigns the ticket to L2.
L2 group is generally a set of people who have the required skills to analyze the issue further. The L2 group could be checking the system logs, checking code, investigating the recent changes, etc. L2 group may contain systems analysts, business analysts, or programmers. If the L2 group cannot find the root cause, they assign this request to the L3 group.
L3 group is generally developers or senior analysts who have deep knowledge of the system or have worked on building the systems. The L3 group could be developers, architects, or DBAs to investigate and debug at a code level.
Major Incident Management or P1 / P2 (Critical Incident Management)
Major incident is also know as critical incident or Sev 1 incident. It is common in the real world to have major incidents now and then. A major incident means some or all of the critical business systems are down or not working. Major incidents can cause major damage to an organization's reputation and can affect the overall business. All the organizations invest considerable amounts in maintaining incident management processes and practices.
Because of the impact and damage major incidents cause, most organizations have an incident manager who is generally on call. The role of the incident manager is to essentially take over the incident and manage it end to end. The incident manager also focuses on engaging stakeholders and getting the incident resolved ASAP. It also makes sense to have a shortened workflow for major incident management. The workflow has two main groups - The service desk and the Incident Manager. As soon as they know that they are looking at a major incident, the service desk gets the incident manager involved.
The incident manager then kicks of the Major Incident Management Process. Typically, the incident manager will set up a bridge call and have all the people working on the incident on the call. All the key stakeholders also join the bridge. The incident manager is also responsible for updating the stakeholders and keep the management updated on the progress. The incident manager engages the L2 and L3 groups depending on the issue.
Interface with other processes / Key Process Relationships
The Incident management process interfaces with various other Service management processes as shown in the diagram above. This diagram depicts how Incident Management is operated and the interfaces associated with it.
- Problem Management - Related Problems and Known Errors.
- Configuration Management - Use of Configuration Records, configuration. Anomalies and potential flagging of services e.g. as ‘failed’ or equivalent.
- Change Management - Details of probable changes to resolve particular Incidents and Problems.
- Service Level Management - Incident management information regarding breaches of services.
The objectives section defines the definition of the term incident and the objectives of incident management.
Scope section defines the scope of incident management which includes any event which disrupts, or which could disrupt service. It includes events that are communicated directly by users, either through the service desk or through an interface from event management to incident management tools.
Incident management activities and the lifecycle of incident record can be briefly mentioned as:
- Detects and records incidents
- Classification, prioritization, and initial support to Customers
- Investigation and diagnosis of incidents, including possibly opening Requests for Change (RFCs)
- Escalation (functional or hierarchical)
- Restores service to its normal operation after the incident resolution
- Provides resolutions according to Service Level Agreements
Keynotes on the critical incident:
- Any Incident that results in significant Business disruption will be called a Major Incident
- Major incidents require shorter resolution timescales and greater urgency due to its impact on Business.
- The definition of what is “Major” must be agreed upon and mapped onto the overall Incident prioritization process.
- May require a MI team under the leadership of the Incident Manager.
- Should not divert the attention of the Service Desk Manager.
- At times an Emergency change might be triggered to resolve a Major Incident.
Key skills for incident management staff
- Good Communication & Analytical skills
- Ability to work under pressure
- Ability to be collaborative
- Quick decision-making capabilities
- Excellent customer handling skills
- Subject knowledge
- Details focused
- Patient and persistent
- ITIL Awareness