Critical Incident Management: Incident and Problem Management

Incident and problem management processes are intended to handle problems that are raised through the service desk as well as responses to major incidents and problems, restoration of IT services, and resolution of the root cause of any issue. Other subprocesses involved include incident and problem escalation as well as root cause analysis.

Incident Management

The objective of incident management is to restore operations as soon as possible, whereas the objective of problem management is to minimize the adverse impact of incidents and problems on the organization caused by errors within the IT infrastructure and prevent reoccurrence of incidents related to these errors. The Information Technology Infrastructure Library (ITIL) definition of an incident is “an incident is any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service.” The aim of incident management is to restore service to the customer as quickly as possible, often through a workaround, rather than through the determination of a permanent resolution.

Problem Management

Problem management is a process that is used to report, log, correct, track, and resolve problems within the hardware, software, network, telecommunications, and computing environment of an organization. Problems can be anything from a customer being unable to print a report to a line connecting the computer to the controller going down (dropping). Problem management provides the framework to open, transfer, escalate, close, and report problems. It establishes procedures and standards for handling customer problems. The ITIL definition of a problem is “a problem is a condition often identified as a result of multiple incidents that exhibit common symptoms.” Problems can also be identified from a single significant incident, indicative of a single error, for which the cause is unknown, but for which the impact is significant. Problem management differs from incident management in that its main goal is the detection of the underlying causes of an incident and their subsequent resolution and prevention.

Roles and Responsibility

Clearly defined roles and responsibilities are critical in problem management to make sure problems are reported, routed to individuals with the ability to resolve the problem, and resolution or workaround communicated to the end user.

Gatekeeper. This role coordinates the consolidated reporting, tracking of service interruptions, problem notification and escalations, and problem coordination and facilitation.
Reporter. This role is responsible for documenting service interruptions, trending analysis, problem resolution, root cause analysis, and proactive problem avoidance.
Change management. This role is responsible for documenting and facilitating a structured change management process, including tracking, scheduling change meetings, and reporting problems caused as a result of change.
Site coordinator. This role is responsible for problem coordination at key service locations, quickly reporting problems, analysis, escalating high-impact problems, resolution, and communicating with the site.
Crisis coordinator. This role is responsible for coordinating the identification and resolution of high-impact problems.

Procedures

Effective problem management procedures are vital to the long-term control over the performance of an IT organization. At most installations, these procedures have been developed piecemeal, as the need for recognizing and resolving specific problems in the organization has arisen. In the early stages of growth, this approach works well, but as the organization grows, this piecemeal approach limits its ability to identify and solve problems effectively.

Problem management procedures should include audit trails for problems and their solutions, timely resolution, prioritization, escalation procedures, incident reports, accessibility to configura-tion, information coordination with change management, and a definition of any dependencies on outside services.

The problem management procedures should ensure that all unexpected events (errors, problems, etc.) are recorded, analyzed, and resolved in a timely manner. Incident reports should be established in the case of significant problems.

Escalation procedures ensure that problems are resolved in the most timely and efficient way possible. Escalation procedures include prioritizing problems based on the impact severity as well as the activation of a business continuity plan when necessary.

Problems should be traceable from the incident to the source cause (e.g., new software release and emergency change). The problem management process should be closely associated with change management.

Problem Severity

In today’s complex environment combined with a high volume of transactions, it is inevitable that problems will occur. The cost of resolving problems must be weighed against the benefit. Thus, a system is needed to identify the severity of problems to ensure the problems with the greatest impact are resolved first. Impact definitions will depend on the organization, but, in general, the following areas are to be considered:

Number of users impacted (as a percentage of total users)
Critical nature of the application (e.g., online banking)
Regulatory/compliance issues
Length of outage
Dependency on system (no workaround)

Problem Escalation

The service desk will not be able to resolve all problems. Some problems will need to be escalated due to the severity or complexity of the issue. A problem escalation process is needed to ensure high-impact issues are routed to the appropriate groups for resolution and communicated to impacted groups.

Root Cause Analysis

For critical and major problems, a full problem review should be undertaken to ensure that the root cause has been understood and appropriate mitigating actions taken. The results of such a review should be communicated to key organization contacts.

Service Improvement Programs

Processes should be implemented to identify those areas of the organization that are most impacted by IT problems (beyond those affected by specific severity issues that are resolved in isolation). Specific service improvement programs should be investigated and jointly developed, with priorities agreed among the IT, relationship manager, and key organization contacts.

Tools

Problem management is a service delivery process that focuses on proactive outage prevention and standardized diagnostic and postrecovery processes. An efficient problem management process flow includes infrastructure and application reporting, communication, tracking, root cause analysis, proactive trending analysis, with an ultimate goal of problem avoidance. To accomplish this requires a common set of tools integrated with asset management, change management, and the service desk.

Problem Reporting

A problem reporting process identifies and collects problems for both the technical and application system environments. It monitors the resolution of these problems, in terms of initial- and long-term response, and reports on the impact the problem has had on the user community. During the process

A problem is identified and reported by a user
The report is recorded in a problem database
Technical personnel are consulted if a problem requires immediate attention, in which case an emergency resolution may be applied
The problem is assigned to the technical group that is responsible for its long-term resolution
The cause of the problem is determined, and its full impact is evaluated
The problem is resolved, and documentation is restored in the problem database

Although individual problems are managed based on severity, daily summary reports are needed for IT organizations to identify issues that impact operational availability. Problems should be collected and combined into a single daily report and reviewed by a team of representatives from all areas of IT (e.g., operations, security). During a daily service meeting, problems are reviewed for root cause, permanent resolution, customer impact, and proactive outage prevention. Follow-up action responses are assigned to the manager who supports the platform or application.

Daily problem reporting can be aggregated and summarized on a weekly or monthly basis for management reporting. Reporting should include the number of incidents, problems, resolution, and trends for key systems and applications.

Availability reporting is based on outages and incidents reported in the daily report. Such reports include information such as the outage or incident’s length, duration, and impact to users. The availability report is used for measuring performance against agreed SLAs. This information can be used to communicate service availability and incidents to both business and IT senior management.

Because it is basically reactive—wait for a problem to develop and then fix it—the IT organization creates a perception of poor performance in its user community. At some point in its growth, it is best to develop procedures that allow anticipation of problems. Having a reliable problem management system will allow the organization to anticipate, report, track, and solve problems in a timely and effective manner.

4 comments:

Anonymous said...: Great post, this is what Ayehu offer for Incident & Problem Management solution thru their Run-Book Automation solution. you can download their free Trial version at www.ayehu.com; March 22, 2010 at 7:59 AM
Lukasz said...: Interesting article.

I suggest to take a look at http://alert-grid.com

This service was designed as a solution for some monitoring-related issues like automated event handling and alerting.; July 20, 2010 at 7:55 AM
Nalli said...: Wonderful blog & good post.Its really helpful for me, awaiting for more new post. Keep Blogging!

Management Audit; April 20, 2012 at 3:46 AM
Anonymous said...: Is any one having flow chart to describe how service desk should react in critical situation like how to approach upper level or to arrange bridge call.; July 19, 2014 at 11:31 PM

Critical Incident Management

Incident and Problem Management