Critical Incident Management: Case Study

Getting the Organization Behind You

Acme has 15,000 workstations and 130 servers in eight offices worldwide. The management structure is not overly complex and involves at most three layers beginning with the CEO. Manufacturing facilities are in China, with a few key parts-fabrication facilities in North America. The company has modest growth, with mostly manufacturing and sales personnel being added. Engineering is a closely held group headquartered in North America.

IT operations are highly distributed and managed locally by IT managers who were hired by local office managers. Although the local office managers report to a global operations director, the local IT managers have no direct connection to the global IT operations. Global IT budget is modest but sufficient, and usually excludes training as the staff is very capable of learning most technologies and processes themselves. Local IT budgets are set by local office managers with the recommendation of the local IT manager. Global IT has overall responsibility for network infrastructure including WAN and LAN configurations. PCs and servers in each local office are managed by the local IT group but are required to adhere to global standards. This structure puts global IT in the position of providing services for networks and security to the local offices but leaves local IT more responsive to the local business requirements, which consists mostly of sales and billing functions.

Last year, an IT employee who was angry that he received no bonus (although his peers did), decided to take revenge. He knew that most of the machines in the company were not patched. Since his computer was connected to the same network, he figured out that he could modify the Sasser worm payload to attack specific hosts. He targeted the workstations of employees who he did not like and who got bonuses.

Unfortunately for Acme, the worm kept going and left several systems disabled. In fact, the worm jumped to several critical servers. Later, forensic investigations revealed the source of the attack. The machines took weeks to patch and restore to service. The cost to the company in productivity and lost revenue was enormous. The employee was terminated.

Acme instituted a comprehensive network monitoring system. They also purchased and implemented a patch management system. An IT employee volunteered to manage the system in addition to his regular job as an e-mail administrator.

First, let’s look at the organization chart for Acme in Figure 1. It may seem like an unusual structure but it has served them well for years. With emerging technologies and greater connection to business partners through the Internet, roles such as technology strategy director and a separate IT operations director make sense. Security has recently become a concern and a small organization is built to address it. Ward, the security and risk manager, used to manage risk in business ventures and is a technology enthusiast. He sees this as a lateral career move and he has set up an intrusion protection system (IPS) at HQ and is knee-deep in the technology on a daily basis.

Figure 1: ACME organization chart.

Harold, a long-trusted employee, was asked to head up a VM program to avoid any more problems such as those that happened before. He reports to Ward. Harold has thoroughly researched all of the available VM tools on the market, talked to the desktop and server administrators, and finally selected a tool. Devices were deployed in all eight locations to scan for vulnerabilities. Scanning started on May 3. The following is a diary of the events that took place beginning that day.

1 Events

May 3: Harold conducted the initial vulnerability scans of the San Francisco office. There appear to be more hosts than he thought. There are only 300 employees in that location. Scanner reports 4094 hosts. A tech support call is placed to the vendor, who reviews the configuration and runs some diagnostics and a test scan of a few hosts. The vendor finds nothing wrong and suggests that Harold check the network configuration. Perhaps the routing is sending the scan to another office.
May 4: Harold suspects that there is something wrong with the San Francisco scanner but is not in a position to argue with the support team. Perhaps the scanners have incorrect default routes. But the configuration matches the others. This problem did not show up during product evaluation.
Harold has been informed by the global messaging manager that he is not to scan any of the e-mail servers until he feels confident that it will not disrupt business. He has a service level agreement (SLA) to maintain. Harold is not happy about this because several critical servers are messaging servers. Some of those servers face the Internet and face higher threat levels. He escalates the situation to his boss, who tells him that he cannot argue the point because he is new in the role and is still trying to build credibility among the rest of the technology managers.
May 7: Scans of all other offices seem normal. The IT managers in those locations have received the initial reports of vulnerabilities.
May 12: After the first week of scanning, the total number of hosts is a little high but can probably be accounted for. Harold is conducting follow-up calls with all the managers.
May 15: San Francisco continues to have scanning problems and still shows 4094 hosts. Chicago is now showing the same number of hosts. A lengthy review of the scan results shows that hosts are being found on every IP address scanned.
May 16: Overall, host average vulnerability scores have declined dramatically. This seems to be good news except that the top scores remain unchanged. In fact, some of the hosts have gotten worse. Harold e-mails the IT manager in New York to find out the status of remediation.
May 20: Further research shows that the scanners are picking up more hosts at each location, which is what has driven down the average score. Every location company-wide has 4094 hosts. The scanners have exceeded their licensed host capacity. Furthermore, the New York IT manager has not responded to his e-mail, so Harold gives him a call. The manager explains that he is in the middle of a major deployment, which involves minor design changes to some of the New York network. He says that once things settle from the deployment, he will have a look.
May 31: Working with technical support, Harold has discovered that something in the network is responding to every IP address and that no host really exists. The work-around for this is to manually enter all of the active host addresses. This is an impractical solution since many of the IP addresses are dynamically allocated. Harold will have to find out what is responding to the device discovery probes.
June 5: None of the vulnerabilities reported have been remediated. Harold consults his manager, Ward, who suggests setting up a conference call with the local IT managers. The earliest he can get a one-hour call is in a week.
June 12: The conference call has only five of eight required participants. The IT managers say that they have no resources to dedicate to remediation but they will try handling the highest priority hosts once per week, if workload permits. Addressing Ward on the call, one of the IT managers tells him that he should deploy only one new technology at a time instead of in parallel so they can assess the overall impact before the next deployment. The managers also complain that they were not informed the system would be deployed and are concerned that the scanning is affecting their network performance. Ward agrees to have scans conducted only at night. The Asia Pacific production manager is also on the call to complain that the scanning may have caused one of his critical servers to fail. Since Asia’s daytime is nighttime in the United States, he does not want it scanned until Harold can prove the scan doesn’t affect the system.
June 16: Some of the worst hosts in two locations have been remediated. The rest of IT spent the weekend cleaning up a new infection introduced by a user who inserted an infected USB key. Many of the networks are still showing 4094 hosts even when the scans take place when most of the desktop computers are turned off.
June 23: Harold is buried in tracking down scanning problems and following up on remediation activities. He discovers that the new IPS, which is built into the firewall software, is causing the scans to show a host on every IP address. In an experiment, he turns off the prevention functionality and performs a scan. It works perfectly. Ward, who hears that his recently deployed IPS was turned off in one location, verbally reprimands him for doing this without discussing it with him. He tells Harold to turn it back on and change his scans to focus only on servers for which he can get a static IP and the system owner approves. Harold successfully showed the Asia Pacific production manager that the scans were harmless to his server. The manager allows the host to be scanned and agrees to get critical vulnerabilities remediated in a week.
June 25: The reports coming out of the vulnerability system are not encouraging. For the hosts that are not phantoms created by the IPS system, the scores have improved very little. The trend lines that show the change in vulnerability are upward but no new vulnerabilities have been found. Harold is puzzled by this and contacts the product support line to report a possible bug. They explain that this is normal behavior.
June 30: The new vulnerability scanning system is scanning about 25 active hosts company-wide. The cost per host is about $400, far in excess of the economies of scale he expected. About 17 of the hosts are getting remediated.
July 18: Frustrated, Harold resigns to find work in an organization that “takes security more seriously.”

So, what happened to Harold and the Acme Company? Did Acme need more people? Not likely. Did Harold select the wrong product? Probably not. He started out an optimist, with intentions of doing a thorough job, but problems quickly arose that created more work, and little got remediated. The effectiveness of the system and Harold came into question amid waning internal support.

September 5: The coup de grâce. In a remote part of the company, an IT employee who is to be terminated decides to go out with a bang. Knowing the state of key server systems based on a flawed standard, he writes a Perl script that employs commonly used administrator passwords to damage dozens of systems. IT managers worldwide are embarrassed and frustrated. They remember that someone was performing the VM function. Too late.

2 Analysis

So, what is required to be successful in a VM program? This example is filled with mistakes from the beginning. Let’s look at what went wrong in the previous example:

Process failure: Two problems were related to process. First, there was no formal and complete project plan with support of senior management. What Harold tried to do first was research technology products. Second, Harold failed to focus on process. VM is a process, not a technology solution.
Preparation: The behavior of the technology was certainly a big problem. Several things didn’t go well. An IPS was deployed at the same time but Harold seemed not to know that. Also, the reports coming out of the system were not well-understood. Those that were sent to IT managers were apparently ignored. Harold probably thought either they just didn’t care or they didn’t understand the reports.
Inclusion: The IT managers were clearly not included in many of the planning activities and had little input into the project. If you look at the organization chart (Figure 1), you can see that the IT managers report to the IT operations director and are not accountable to the same organization as Harold until you reach the CIO level. They were understandably defensive about their networks. When confronted with a new technology about which they knew nothing, their reaction was to protect their turf against an invasive tool over which they had no control or input.
Ownership: And what about the conference calls that Harold and Ward set up with the IT managers? Their low attendance rate reflects their level of interest. From Harold’s point of view, it may have seemed this way. But, the natural reaction of someone who is busy during the deployment of new technology (in this case, the IPS) and is conducting a major change to the design of a network is to focus on the items for which they are accountable and shun those that generate more work. The IPS was part of the firewall software, and therefore considered to be a configuration change to an established technology. Also, the vulnerability scanner spat out reports that demanded administrators do more work. Ignoring the reports would delay possible accountability and impact on the computing environment.

Here is how Harold might have done things differently and achieved a better result:

Ownership: Get direct senior management support for the entire initiative. Make sure that management understands the commitment required to remediate and monitor. If you don’t lead with this important step, expectations are not set and the required impact on operations is unclear. Harold did not have strong backing to get the attention of local IT managers. Since the IT managers were never included in the planning, design, and selection of the technology, they did not have the opportunity to test and validate the technology in the environment. Also, when remediation wasn’t getting done, Harold had nowhere to escalate. Furthermore, senior managers can facilitate communication with key participants in the program. Ideally, the senior manager communicating the importance of the VM initiative should be a global CIO, CSO, or CISO.
Related to senior management commitment, highlight the requirement that successful remediation is a key job performance indicator. In the previous example, Harold did not have the interest of the local IT managers. They seldom fixed anything because it seemed to them that Harold was whining about how broken their systems were, which created more work for which they would get no recognition. There was simply no incentive. Harold was never received as a key supplier of solutions to crucial IT challenges. On the other hand, if the IT managers were told that they were being measured on their remediation performance, then they would be more likely to comply. In fact, they would quite possibly be proactive. To add a motivational carrot to the stick, a competition for lowest average score might have driven excellence into the process. To the individual manager with the lowest average score or perhaps the greatest improvement would go additional bonus budget.
Preparation: Test! Test! Test! Any system connected to a network with broad-reaching and possibly invasive functionality should be completely tested with all components touched in the environment. Ultimately, a list of criteria for systems that can be scanned should be developed. A list of adverse effects on systems should be well-documented and the candidates for not-scanning identified.
Process: Change management must be carefully considered prior to any implementation. As it happens, in Harold’s company, a new firewall system with a built-in IPS was in the implementation phase. When scans began, San Francisco had already deployed a new firewall. This firewall answered discovery scans on every IP address in the range allocated to that location. (More about this phenomenon later in this book.) As Harold continued his scans, the firewalls continued to be rolled out. Eventually, every network was giving him bogus results. Had he participated in the change management process, he would have known that firewalls were being deployed, and he could have tested the effects in advance. Harold could have also correlated the change in behavior of the scans with the sequence of firewall installation.

Critical Incident Management

Case Study | VM Program Failure

Getting the Organization Behind You

1 Events

2 Analysis

0 comments:

Popular Posts

Search This Blog

Blog Archive

Total Pageviews