Critical Incident Management: 9/25/11

Abacus Corporation is a global manufacturer and distributor of electronic abacuses that employ LCD technology and a special gesture interface that enables calculations to be performed much faster than with a conventional calculator. By all measures, Abacus resembles a rapidly growing but small electronics company holding a few key patents.

Abacus has manufacturing facilities in Nebraska and Alabama. Syllog, a business partner based in Germany, where most of the exotic devices are purchased, handles distribution. Syllog is a distributor of multiple unique electronic devices mostly to Asian customers and niche retailers in the United States and the United Kingdom. Most of the devices are provided by only three manufacturers with which Syllog maintains close relationships in business and infrastructure. About 40 percent of Syllog’s infrastructure was co-funded by Abacus and serves Abacus directly.

At Abacus, the IT infrastructure is very mature with solid ITIL-based change management and incident management tools. All business partners are required to meet or exceed Abacus’ policies and standards. They have implemented a “service desk” model with a few essential modules from the ITIL framework. This service desk is the central point for receiving and managing incidents and escalating changes. Since Abacus is a manufacturing operation with unpredictable order volume, all production is performed on a just-in-time basis. That is to say, they don’t design and manufacture their products until they have an order. Furthermore, they are committed to delivering the products within 10 calendar days of order receipt. So, there is little tolerance for downtime.

Sales-related IT operations at Abacus are managed locally with market-specific applications. The varying languages, cultures, and unique business relationships in each market require equally varying hardware, software, and applications. Common operating system and underlying utility software is provided and managed by the global operations. However, the applications and non-ubiquitous software are handled locally. To keep this arrangement, specific service levels have been established between global and local operations groups.

Naturally, management has decided that the next step is to implement a full VM initiative throughout the company. There is strong connectivity to all the sites, including business partners who have also agreed to participate. Abacus uses primarily Microsoft software on desktops and Linux^® on servers, with certain offices using a few implementations of Solaris™.

In addition to cleaning up vulnerabilities, Abacus wants to verify compliance and patch status in as many parts of the extended enterprise as possible. They have mature internal processes but few systems support resources. The ratio of support engineers and technicians is about 300 to 1. Automation is a key factor for Abacus. For example, there are about 20 standard systems maintenance and data collection shell scripts that run weekly on Linux systems, so scanning for vulnerabilities by an automated system rather than relying purely on internal processes is essential. The head of the systems engineering group, Carl, is given responsibility for implementation.

1 Events

November 2: With full backing of senior management, Carl has assembled a small team representing the systems support team, desktop engineering, and the director of networks to select a tool and modify existing processes. A budget of $350,000 is approved for the project.
November 23: On time and within budget, the team has selected a combination of two tools that appear to work well with all systems, including network devices for vulnerability scanning of desktops and agents on servers. The original idea was to have the agents run on all systems but the impact to local operations was too great to install and maintain yet another agent on an already-crowded desktop. Since the agents seemed to have a low impact on server operations and the reporting provided unique features such as integrated patching, the thinking is that automated patching will help with workload. The compliance reports are exactly what are needed to keep managers focused on achieving the desired results. Total cost: $330,000 installed.
November 30: Acquisition is completed, the software is installed on a server, and hardware devices are deployed throughout the offices as needed. With the consent of the management of the business partners, certain locations that could have experienced problems scanning over the WAN connections have also installed scanners, but have complained about the high cost for only a few hosts. To adapt, some of the server agents have been tasked with scanning local desktops without agents.
December 6: A freeze on all non-critical change control items has gone into effect due to the holiday period when many key employees will be on vacation. This lasts for 30 days. Not all of the network scanners have been installed.
January 15: Everyone is back from vacation and Carl has put together a communication plan to keep all key managers in the loop on vulnerability scanning. He has created the initial user names with passwords and privileges representing the six IT directors in key offices. They will be the primary users of the system. One of the agreed-upon process changes is that the IT directors will log in to the system to see the status of vulnerabilities in their area of the company. Access to the vulnerability information for each of these areas is defined by IP address range.
February 2: Initial scans have been completed and IT directors are automatically informed on the results. They are very satisfied with the quality, content, and accuracy of the reports. It appears to be another technology and process triumph at Abacus. Carl has turned over the system to the production systems support group, which also had members on the VM system development team. Some of the IT managers have complained that they have to go into two separate systems to see all of their hosts. The system that operates the agents is giving them server vulnerability information and the network scanning system provides desktop information. Carl begins a discussion with vendors on the best approach.
February 10: The IT director in Nebraska asks that all nine of his desktop support group be given access to the system so they can pull reports and perform the remediation. The system administrator provides a list of information items required to be completed in a form for setting up the users. The form completion, return, data entry, and password verification/reset takes about two days to complete for both VM systems.
February 12: The business partner, Syllog, complains about getting strange scan results. It appears that the names of the desktops on Syllog’s network are incorrect and instead are showing hosts on Abacus’ server farm. Checking with the IT manager at HQ, it appears that the Syllog manager is in fact getting scan results from both networks. The cause turns out to be due to the fact that they use the same IP addresses. The vulnerability scanner is reporting correctly to the server but there is overlap with the IPs.
February 14: The IT manager in another office requests similar access for all of this desktop support team, a total of six people. They must have access to perform their remediation as well. With more users, detailed procedures have to be developed for the two systems operating in combination and separately.
February 18: A technical solution is developed to rely on computer name, not IP address, to identify the host on a report. All of the network definitions and privileges have to be updated to reflect the change. Separately, an initiative is started to integrate the two vulnerability systems into a single reporting infrastructure. This is done by a single systems engineer with extensive experience in Perl programming.
February 23: The remaining IT directors in other locations, six locations in all, also request that their support people be permitted to have access to the VM system. Two of the directors want to divide the responsibility for their systems between two different support groups: servers and desktops. This is because of the specialized skill required in remediation. This makes for the addition of a total of 32 users to the systems.
February 25: After three days of work for one person, the changes to the network definitions and permissions are complete. There are now 47 users of the systems. Some server managers complain that the patches that are installed are not thoroughly tested and need to be scheduled more carefully to not affect production operations.
March 1: The VM team is informed that a user in Nebraska resigned two weeks earlier and his access should be revoked. Following the user-termination process, the user’s access is revoked. A replacement user has not been found. Therefore, only the IT director in Nebraska can run those reports.
March 4: Initial reports show a significant lag between the time when vulnerabilities are reported and the time they are finally fixed. After about a day of calls, the VM administrators determine that the delay is caused by the time between finding the vulnerabilities on the report and entering them into the incident and change management systems. The process that was defined in the beginning works fine but requires up to three weeks to enact remediation. Carl has a meeting with the compliance director and the information security director to discuss the current performance of the new system and the slow remediation issue. The compliance director points out that the maximum allowed time for remediating a critical vulnerability is seven calendar days. According to the reports in the system, this is taking two to three weeks.
March 5: The Microsoft-based systems are configured with credentials using Active Directory^® to perform in-depth scans. In effect, these credentials allow the scanner to log in to the system to check critical patches and configuration items. Management reports have been consolidated from the two systems into a single report that is run on demand through a Web interface. The system engineer begins work on detailed remediation reports. However, many of the data elements from the two systems do not directly match. One uses CVE (Common Vulnerabilities and Errors) codes to identify vulnerabilities and the other uses Bugtraq identifiers.
March 10: Carl has spent the last week trying to perfect the current process for reporting and remediation and has only gained about one day of time on average. The central problem seems to be that those entering the information into the incident and change management systems have other responsibilities, including performing the actual fixes. Issuing patches through the patch management system can fix some items. These changes happen quickly with little problem that can be addressed within a day or two. However, about 30 percent of the vulnerabilities require manual intervention. Furthermore, the agent-based VM system seems to be designed to use its own internal change management system with no ability to automatically export changes and import change release status.
March 11: The UNIX^®-based systems are not yet being scanned. Each one of then has to be set up manually with the appropriate credentials for in-depth inspection by the new system. This is a time-consuming process that requires numerous change management events to be created. To save time and work, some changes are bunched together in the system. But they can only combine those events for systems that have the same configuration, that is, virtually identical systems such as in a load-balanced configuration.
March 12: After careful consideration and discussion with his boss, Carl decides that the best course of action is to bring in a special developer to interface the patch, change, and incident management systems with the new agent-based VM system. However, the system engineer who has been coding Perl reports has no time to work on the system further, given a hectic change schedule. The engineer turns over all the code to Carl.
March 20: An initial assessment with the developer and the vendor’s support team results in an estimate totaling $152,000. There are three primary reasons for this: First, the system is designed to generate only SNMP traps when a vulnerability is found. None of the change or incident management systems are compatible with this protocol. A custom interface will either have to be written into the VM system or into the internal systems. Second, the workflow from incident management to change management allows for automatic generation of change events only when an incident is entered manually. An automated interface will have to be developed to put entries into both systems and link them together using information from the incident management database. Furthermore, the process is different for vulnerabilities than for other incidents because there must be a verification scan before the change can be closed out. Finally, the VM system uses CVE and BugTraq numbers to identify what patch needs to be applied to a system. The patch management system uses a proprietary set of codes that is more specific about which patch is required. Additional data would have to be developed to properly match these codes and identify where they do not properly align.
March 23: Total estimated costs of the system in a fully functional state are now $482,000 (initial $330K + $152K). Senior management senses that this might turn out to be a money pit and limits total expenditure to what was initially requested ($350,000). For the money that is left ($20,000), the developer can send the required XML-formatted messages to the incident management system to save some data entry time. After that, Carl’s budget is depleted.
Epilog: For the remaining year, VM system users are pleased and yet confused about the incomplete process involved in remediation. Considerable ongoing effort is put into remediating and closing out incidents. Users still have to create change requests, run verification scans manually, and then close out changes. Carl continues to believe that senior management is pennywise but pound-foolish. Some wonder why the company didn’t just go with a single system and avoid all the confusion.

2 Analysis

Events seemed to unfold rather smoothly at the start. This was clearly a very mature IT operation with strong process management and supporting systems. Project management seemed very structured. There was good executive support for the initiative and a budget was set and agreed upon. Carl, the project manager, involved all the right parties to get mutual ownership. This last factor is what allowed the systems to stay in operation even after shortcomings became apparent.

First, let’s address the issue of process, so that it is clearly understood how it works at the Abacus Corporation. Although they are very disciplined and efficient, IT seems to suffer from this process.

An incident, in the ITIL framework definition, is some event that disrupts the agreed-upon service level and underlying business activity. Service levels are set for vulnerabilities and remediation time frames, and a newly discovered vulnerability will result in a potential failure to meet SLAs if not corrected in a timely fashion. When a vulnerability is discovered by Abacus’ new VM system, an incident must be generated in the incident management system to track the resolution. Since Abacus runs a lean IT shop, the service desk functions are performed by a variety of IT staff, depending on the type of event to be handled. So, the service desk is a very dynamic entity with the support of software-based routing to the appropriate parties. See Figure 2.2 to follow this process.

Vulnerability information is assessed by the appropriate individuals depending on the network and host type affected. If a vulnerability is critical, an incident is created to track the vulnerability to resolution. Non-critical vulnerabilities are put into a scheduled change process for non-urgent changes.
When a resolution involves a change that is complex, and it is determined that to implement the change may impact other functions, a change is initiated in the change management system and a change ticket is generated. This allows affected parties to be included in the change process with appropriate notification of potential impacts. It also informs the manager of the change of the technical configuration elements of the target.
Once a patch is applied or a configuration updated to remediate the vulnerability, a change ticket is conditionally closed. The change ticket is closed with the caveat that the same vulnerability must not appear in the next vulnerability scan.
A vulnerability scan is run again, either as a part of the regular schedule or manually upon request. The scan performed on request is typically faster and only targets the systems affected by the change.
The incident ticket will be closed after the change ticket closes. This takes place automatically between the change and incident management systems, but only if the incident management system automatically produced the change ticket.

Figure 1: Abacus Corporation vulnerability service desk process.

To the uninitiated, this may seem like a cumbersome process, but the staff at Abacus is very disciplined and well-trained. The process works exceptionally well at providing the critical production systems, typically uninterrupted and predictable service. So, in this case the process seems to be well-managed. However, there are several deficiencies in the systems integration process.

One obvious problem is the continuous interplay between the incident management system and the change management system. For every critical vulnerability, IT personnel in a service desk role have to create an incident for tracking purposes. Then, if it is determined that a significant change is required, a change ticket has to be created to notify others who might be affected by the change.

An example of a vulnerability that would be well-managed by this process would be if a Microsoft SQL server system were found to have a weak password on a commonly used user ID. That password weakness would be a vulnerability. If that database server were accessible by a large number of people, possibly even from outside the company, this could be a very serious attack vector. To fix this, the password would have to be changed, but doing so might break several applications that rely on that password. So, a change ticket is created and the application owners are notified. Once the application owners coordinate the change to affected systems, the change can be completed.

Now, once a change is completed, the change ticket is closed. At Abacus, the ticket is closed but not the incident. First, a scan must be performed on the system to verify the remediation success. In this example, the strength of the password would be tested. If the vulnerability is no longer present, then the incident is manually closed.

So, we have now performed four manual tasks that could have been automated. Follow this procedure in Figure 1. The boxes are shaded to indicate which steps are manual and which are automated. There is little to no automation in this diagram. The diagram has numbers, which correspond to the following:

1. A vulnerability manager reviews a report to identify critical vulnerabilities in hosts. This activity is ideally suited to an automated system. Vulnerabilities are typically well-known and evaluated by experts around the world. An automated system has this assessment built in and is able to take action based on that information.
2, 3. If a critical vulnerability is found, the manager opens an incident ticket and assigns it to the owner of the system.
4, 5. After remediating, the change ticket is closed and the vulnerability manager rescans the host in question.
6, 7. The report is reviewed to determine whether the vulnerability still exists. This is essentially a repeat of step 1, which again can be easily automated.

The creation of an incident ticket is a simple tracking and interface activity that can be performed by a machine. In step 3, the rescan activity can be automated by interfacing the change management system with the VM system to allow for notification that a change was complete, thereby initiating a follow-up scan. Alternatively, the action may take place manually, depending on the process used. If the process called for waiting until the next scheduled scan, then step 4 is not required. If a manual scan is called for in the process, then a rescan may be necessary. One refinement of the rescan process is a limitation or parameter applied to the target. For example, a particular host or network of hosts may have a constraint that only allows them to be scanned at night in case service is affected. This would prevent any potential outage during business hours. Therefore, once the change is complete, the rescan will only take place later that night and not immediately.

Now, let’s look at the process from an automated perspective. Refer to Figure 1 again. The following process closely resembles the earlier process. However, in the diagram I have indicated some steps with double boxes to indicate where automation can be performed (Version 2 Automatic). In each automated step, the process is greatly accelerated to require only seconds to complete. The following numbers are also indicated in the diagram:

Critical vulnerabilities cause the VM system to automatically create an incident ticket and assign it to the individual specified for the network where the host resides. In this case, this individual is known as the incident manager. The incident ID is captured by the VM system.
The incident manager reviews the required change and determines whether it requires significant work or can affect other systems.
The incident manager goes into the incident management system to flag the incident as a required change, which is instantiated in the change management system using the existing interface between the two systems.
The change is performed as planned by an engineer, possibly the incident manager.
The change activity is closed out via the incident management system user interface by the engineer who performed the work. There is no way to automate this action but it requires little effort. In fact, some change management systems have the ability to listen for e-mail message replies to update status. This update automatically triggers notification to the incident management system to tentatively close the incident.
The incident management system then sends a confirmation message to the VM system using the incident ID. Indexed using the incident ID, another scan on the single host is initiated, checking for the specific vulnerability and any others.
If the vulnerability is not present on the follow-up scan, a confirmation is sent to the incident management system to close the incident. If the vulnerability is still present, the incident confirmation is rejected, causing another notification to the incident manager. Steps 3 through 6 are repeated.

Another problem with the implementation is the selection of two separate products that were never designed to work together. Although some products on the market can perform both agent-based and network-based vulnerability assessments, the previous scenario is a common one. This is often the case when two seemingly ideal products lack the key compatibility to “hit a home run” in the VM game.

As a result, users were initially forced to work with two different systems. At first, the agent-based system scanned only servers and the host-based system scanned only desktops. This logical separation fit the company management model of local versus global separation of responsibilities. But, when the agent-based system was used to scan desktops where a network-based scanner was not financially feasible, the division of responsibilities and system usability broke down.

An enthusiastic attempt to rescue the effort was made by a knowledgeable system engineer. However, the low hanging fruits were the management reports. Since no system interaction was necessary, data gathering and normalization were simpler. There were still issues with data structures and standard values across systems. These are all the same problems that would be found in any other application. Later, we will see where the industry is making strides towards avoiding these problems.

In addition to the automation surrounding identification, remediation, and unified reporting, other systems can be integrated to great advantage.

On February 10, nine user IDs are requested for addition to the system. At this point, the system administrator should perhaps be wishing that this was an easier task. It has been time-consuming to collect the data and enter it into the system. Since Abacus is a Microsoft shop, why not integrate the user identification and authentication with Active Directory? Or even use the almost-ubiquitous RADIUS protocol. This piece of system integration is essential for most enterprises today that employ some standard type of authentication mechanism. We can also see that several other users have to be added to the system as well on February 13, 14, and 25. There were eventually almost 50 users to set up. When one user left the company, Carl’s team only found out two weeks later that they could remove the user’s access. It is more likely that the user assigned to an Active Directory group had already been removed or disabled.

An example of how a directory structure might align with a VM system is shown in Figure 2. There are objects and actions in the VM system. A group or user can be a member or a role, which includes various combinations of object and actions. For example, an individual with the administrator role would be responsible for maintaining all areas of the system. Therefore, he would be given access to all objects and all actions. However, the previously mentioned engineer would only have access to reporting capabilities in a particular office. So, that group’s permitted action would be “Report” and the permitted objects would be those “Networks” that are in the local office.

Figure 2: Vulnerability system roles aligned with directory structures.

Each of the groups or users could be assigned to one or more roles and networks to give them the capabilities required as shown in Figure 3. The power of this arrangement comes from the fact that when a user changes to another group, the user will assume the roles of that group and not the one from which he came. For example, if an engineer moved from the IT department in California to the Desktops group in Nebraska, that engineer would then be able to have access to vulnerability information only for his new position. Vulnerability managers would similarly only be able to perform scans on the assigned networks associated with their group.

Figure 3: Directory structure and VM system roles.

This kind of integration seems obvious now, but at the beginning, this many users might have been inconceivable to Carl or the rest of the team. This also sends us back to our earlier discussion about how vulnerabilities get created. This scenario is similar to one in which the original system designers fail to take into account all the ways a system may be used. It is a requirements gathering activity with too many assumptions. Carl’s design and selection team should have worked out precisely who would use the system, how many users there would be, and what types of activities were likely to take place with those users. There would be terminations, new hires, analysts reviewing reports, and perhaps others who change the scanning parameters or schedules.

Critical Incident Management

The VM Program and Technology Development

Case Study: Technology Integration Challenge

1 Events

2 Analysis

Popular Posts

Search This Blog

Blog Archive

Total Pageviews