Case Study | VM Program Failure


Getting the Organization Behind You

Acme has 15,000 workstations and 130 servers in eight offices worldwide. The management structure is not overly complex and involves at most three layers beginning with the CEO. Manufacturing facilities are in China, with a few key parts-fabrication facilities in North America. The company has modest growth, with mostly manufacturing and sales personnel being added. Engineering is a closely held group headquartered in North America.
IT operations are highly distributed and managed locally by IT managers who were hired by local office managers. Although the local office managers report to a global operations director, the local IT managers have no direct connection to the global IT operations. Global IT budget is modest but sufficient, and usually excludes training as the staff is very capable of learning most technologies and processes themselves. Local IT budgets are set by local office managers with the recommendation of the local IT manager. Global IT has overall responsibility for network infrastructure including WAN and LAN configurations. PCs and servers in each local office are managed by the local IT group but are required to adhere to global standards. This structure puts global IT in the position of providing services for networks and security to the local offices but leaves local IT more responsive to the local business requirements, which consists mostly of sales and billing functions.
Last year, an IT employee who was angry that he received no bonus (although his peers did), decided to take revenge. He knew that most of the machines in the company were not patched. Since his computer was connected to the same network, he figured out that he could modify the Sasser worm payload to attack specific hosts. He targeted the workstations of employees who he did not like and who got bonuses.
Unfortunately for Acme, the worm kept going and left several systems disabled. In fact, the worm jumped to several critical servers. Later, forensic investigations revealed the source of the attack. The machines took weeks to patch and restore to service. The cost to the company in productivity and lost revenue was enormous. The employee was terminated.
Acme instituted a comprehensive network monitoring system. They also purchased and implemented a patch management system. An IT employee volunteered to manage the system in addition to his regular job as an e-mail administrator.
First, let’s look at the organization chart for Acme in Figure 1. It may seem like an unusual structure but it has served them well for years. With emerging technologies and greater connection to business partners through the Internet, roles such as technology strategy director and a separate IT operations director make sense. Security has recently become a concern and a small organization is built to address it. Ward, the security and risk manager, used to manage risk in business ventures and is a technology enthusiast. He sees this as a lateral career move and he has set up an intrusion protection system (IPS) at HQ and is knee-deep in the technology on a daily basis.

 
Figure 1: ACME organization chart.
Harold, a long-trusted employee, was asked to head up a VM program to avoid any more problems such as those that happened before. He reports to Ward. Harold has thoroughly researched all of the available VM tools on the market, talked to the desktop and server administrators, and finally selected a tool. Devices were deployed in all eight locations to scan for vulnerabilities. Scanning started on May 3. The following is a diary of the events that took place beginning that day.

Events

  • May 3: Harold conducted the initial vulnerability scans of the San Francisco office. There appear to be more hosts than he thought. There are only 300 employees in that location. Scanner reports 4094 hosts. A tech support call is placed to the vendor, who reviews the configuration and runs some diagnostics and a test scan of a few hosts. The vendor finds nothing wrong and suggests that Harold check the network configuration. Perhaps the routing is sending the scan to another office.
  • May 4: Harold suspects that there is something wrong with the San Francisco scanner but is not in a position to argue with the support team. Perhaps the scanners have incorrect default routes. But the configuration matches the others. This problem did not show up during product evaluation.
  • Harold has been informed by the global messaging manager that he is not to scan any of the e-mail servers until he feels confident that it will not disrupt business. He has a service level agreement (SLA) to maintain. Harold is not happy about this because several critical servers are messaging servers. Some of those servers face the Internet and face higher threat levels. He escalates the situation to his boss, who tells him that he cannot argue the point because he is new in the role and is still trying to build credibility among the rest of the technology managers.
  • May 7: Scans of all other offices seem normal. The IT managers in those locations have received the initial reports of vulnerabilities.
  • May 12: After the first week of scanning, the total number of hosts is a little high but can probably be accounted for. Harold is conducting follow-up calls with all the managers.
  • May 15: San Francisco continues to have scanning problems and still shows 4094 hosts. Chicago is now showing the same number of hosts. A lengthy review of the scan results shows that hosts are being found on every IP address scanned.
  • May 16: Overall, host average vulnerability scores have declined dramatically. This seems to be good news except that the top scores remain unchanged. In fact, some of the hosts have gotten worse. Harold e-mails the IT manager in New York to find out the status of remediation.
  • May 20: Further research shows that the scanners are picking up more hosts at each location, which is what has driven down the average score. Every location company-wide has 4094 hosts. The scanners have exceeded their licensed host capacity. Furthermore, the New York IT manager has not responded to his e-mail, so Harold gives him a call. The manager explains that he is in the middle of a major deployment, which involves minor design changes to some of the New York network. He says that once things settle from the deployment, he will have a look.
  • May 31: Working with technical support, Harold has discovered that something in the network is responding to every IP address and that no host really exists. The work-around for this is to manually enter all of the active host addresses. This is an impractical solution since many of the IP addresses are dynamically allocated. Harold will have to find out what is responding to the device discovery probes.
  • June 5: None of the vulnerabilities reported have been remediated. Harold consults his manager, Ward, who suggests setting up a conference call with the local IT managers. The earliest he can get a one-hour call is in a week.
  • June 12: The conference call has only five of eight required participants. The IT managers say that they have no resources to dedicate to remediation but they will try handling the highest priority hosts once per week, if workload permits. Addressing Ward on the call, one of the IT managers tells him that he should deploy only one new technology at a time instead of in parallel so they can assess the overall impact before the next deployment. The managers also complain that they were not informed the system would be deployed and are concerned that the scanning is affecting their network performance. Ward agrees to have scans conducted only at night. The Asia Pacific production manager is also on the call to complain that the scanning may have caused one of his critical servers to fail. Since Asia’s daytime is nighttime in the United States, he does not want it scanned until Harold can prove the scan doesn’t affect the system.
  • June 16: Some of the worst hosts in two locations have been remediated. The rest of IT spent the weekend cleaning up a new infection introduced by a user who inserted an infected USB key. Many of the networks are still showing 4094 hosts even when the scans take place when most of the desktop computers are turned off.
  • June 23: Harold is buried in tracking down scanning problems and following up on remediation activities. He discovers that the new IPS, which is built into the firewall software, is causing the scans to show a host on every IP address. In an experiment, he turns off the prevention functionality and performs a scan. It works perfectly. Ward, who hears that his recently deployed IPS was turned off in one location, verbally reprimands him for doing this without discussing it with him. He tells Harold to turn it back on and change his scans to focus only on servers for which he can get a static IP and the system owner approves. Harold successfully showed the Asia Pacific production manager that the scans were harmless to his server. The manager allows the host to be scanned and agrees to get critical vulnerabilities remediated in a week.
  • June 25: The reports coming out of the vulnerability system are not encouraging. For the hosts that are not phantoms created by the IPS system, the scores have improved very little. The trend lines that show the change in vulnerability are upward but no new vulnerabilities have been found. Harold is puzzled by this and contacts the product support line to report a possible bug. They explain that this is normal behavior.
  • June 30: The new vulnerability scanning system is scanning about 25 active hosts company-wide. The cost per host is about $400, far in excess of the economies of scale he expected. About 17 of the hosts are getting remediated.
  • July 18: Frustrated, Harold resigns to find work in an organization that “takes security more seriously.”
So, what happened to Harold and the Acme Company? Did Acme need more people? Not likely. Did Harold select the wrong product? Probably not. He started out an optimist, with intentions of doing a thorough job, but problems quickly arose that created more work, and little got remediated. The effectiveness of the system and Harold came into question amid waning internal support.
  • September 5: The coup de grĂ¢ce. In a remote part of the company, an IT employee who is to be terminated decides to go out with a bang. Knowing the state of key server systems based on a flawed standard, he writes a Perl script that employs commonly used administrator passwords to damage dozens of systems. IT managers worldwide are embarrassed and frustrated. They remember that someone was performing the VM function. Too late.

Analysis

So, what is required to be successful in a VM program? This example is filled with mistakes from the beginning. Let’s look at what went wrong in the previous example:
  • Process failure: Two problems were related to process. First, there was no formal and complete project plan with support of senior management. What Harold tried to do first was research technology products. Second, Harold failed to focus on process. VM is a process, not a technology solution.
  • Preparation: The behavior of the technology was certainly a big problem. Several things didn’t go well. An IPS was deployed at the same time but Harold seemed not to know that. Also, the reports coming out of the system were not well-understood. Those that were sent to IT managers were apparently ignored. Harold probably thought either they just didn’t care or they didn’t understand the reports.
  • Inclusion: The IT managers were clearly not included in many of the planning activities and had little input into the project. If you look at the organization chart (Figure 1), you can see that the IT managers report to the IT operations director and are not accountable to the same organization as Harold until you reach the CIO level. They were understandably defensive about their networks. When confronted with a new technology about which they knew nothing, their reaction was to protect their turf against an invasive tool over which they had no control or input.
  • Ownership: And what about the conference calls that Harold and Ward set up with the IT managers? Their low attendance rate reflects their level of interest. From Harold’s point of view, it may have seemed this way. But, the natural reaction of someone who is busy during the deployment of new technology (in this case, the IPS) and is conducting a major change to the design of a network is to focus on the items for which they are accountable and shun those that generate more work. The IPS was part of the firewall software, and therefore considered to be a configuration change to an established technology. Also, the vulnerability scanner spat out reports that demanded administrators do more work. Ignoring the reports would delay possible accountability and impact on the computing environment.
Here is how Harold might have done things differently and achieved a better result:
  • Ownership: Get direct senior management support for the entire initiative. Make sure that management understands the commitment required to remediate and monitor. If you don’t lead with this important step, expectations are not set and the required impact on operations is unclear. Harold did not have strong backing to get the attention of local IT managers. Since the IT managers were never included in the planning, design, and selection of the technology, they did not have the opportunity to test and validate the technology in the environment. Also, when remediation wasn’t getting done, Harold had nowhere to escalate. Furthermore, senior managers can facilitate communication with key participants in the program. Ideally, the senior manager communicating the importance of the VM initiative should be a global CIO, CSO, or CISO.
    Related to senior management commitment, highlight the requirement that successful remediation is a key job performance indicator. In the previous example, Harold did not have the interest of the local IT managers. They seldom fixed anything because it seemed to them that Harold was whining about how broken their systems were, which created more work for which they would get no recognition. There was simply no incentive. Harold was never received as a key supplier of solutions to crucial IT challenges. On the other hand, if the IT managers were told that they were being measured on their remediation performance, then they would be more likely to comply. In fact, they would quite possibly be proactive. To add a motivational carrot to the stick, a competition for lowest average score might have driven excellence into the process. To the individual manager with the lowest average score or perhaps the greatest improvement would go additional bonus budget.
  • Preparation: Test! Test! Test! Any system connected to a network with broad-reaching and possibly invasive functionality should be completely tested with all components touched in the environment. Ultimately, a list of criteria for systems that can be scanned should be developed. A list of adverse effects on systems should be well-documented and the candidates for not-scanning identified.
  • Process: Change management must be carefully considered prior to any implementation. As it happens, in Harold’s company, a new firewall system with a built-in IPS was in the implementation phase. When scans began, San Francisco had already deployed a new firewall. This firewall answered discovery scans on every IP address in the range allocated to that location. (More about this phenomenon later in this book.) As Harold continued his scans, the firewalls continued to be rolled out. Eventually, every network was giving him bogus results. Had he participated in the change management process, he would have known that firewalls were being deployed, and he could have tested the effects in advance. Harold could have also correlated the change in behavior of the scans with the sequence of firewall installation.

Rationale for a VM Program | Vulnerability Management



So, why do we undertake a VM program? There are several good reasons, which are either technical or just make good business sense. Choose what best fits your situation.

Overexposed Network

In the context of IT security, businesses have two kinds of objectives: mission and compulsory. A mission objective is directly related to producing revenue or enhancing profits. Compulsory objectives are those that must be achieved as a matter of prudence or regulation. Insurance is an example of a compulsory objective. Companies purchase liability insurance to reduce their exposure to loss. Many organizations place a high priority on meeting mission objectives and not compulsory ones. Network security is another example of a compulsory business objective. In this scenario, network-based defenses are inadequate to stop a well-designed attack. Naturally, some companies may choose to perform a risk analysis and will subsequently determine how much to spend on network security. Sometimes the size and complexity of the network make cost-effective defenses impractical to meet the risk. In other cases, the company simply chooses to accept the risk.
If your company has strong network defenses, then perhaps there is a lot of focus on detecting and preventing an attack. However, blocking in the network or on the perimeter is not enough. It is unlikely that the defenses are completely reliable and foolproof, and address all of the potential attacks from every vector. Any security professional knows that the insider threat is as great as the external one. Most defenses are not directed towards insider threats. At present, it is not financially practical to put intrusion protection, anti-virus, content filtering, traffic analysis, and application behavior analysis on every single port on a network of 50,000 nodes. Even if one could achieve this in a financially sound way, additional, redundant layers of defense will be needed because some of those network-based defenses will have weaknesses that are exploited and may even be used against the organization. For example, there are several well-known methods for evading intrusion detection systems. Encryption is a simple method for obfuscating an attack. Applications that use encryption are very helpful in concealing attacks.
The only way to address these weaknesses is a basic defense-in-depth strategy that removes single points of failure. Most network security strategies rely on perimeter and/or bolt-on defenses, which, if they fail, will leave a vulnerable host wide open to exploitation. This is overexposure at its worst. There is a perception of security but no real security without adding additional layers, one of which must be a host hardened against attacks by taking away the vulnerability.

No Standard for Secure Systems Configuration

Large companies typically develop one or more standard configurations for systems connected to a network. This includes standards for desktop and server operating systems, network devices, and even printer configurations. These standards often have security practices built in. When these standards are absent, more vulnerabilities are likely to exist than when the standards do not exist. The mere fact that a standard exists suggests that some care and thought is being given to the state of a device.
However, even if there are standards, configurations can age. Once a standard is established, it is difficult to change and bring all hosts into compliance. Even if there is a patch management system in place, those configurations cannot be fully addressed by patches. In most cases, patch management systems will not find everything requiring remediation. It is no substitute for VM.
The negative side of standardization is the ubiquity of vulnerabilities. If a standard configuration is deployed globally and has a vulnerability, then the vulnerability is everywhere. If not detected and remediated quickly, it can lead to serious security problems.

Risk of Major Financial Loss

When the risk of a breach is high, concerns of management naturally turn towards the impact of realizing the risk; that is, with increasing regulation from government and an aggressive tort system, the potential for financial loss greatly increases. These losses can come from litigation and/or civil penalties. Imagine losing a client’s confidential data due to failure to remediate a critical, published vulnerability. I can see the tort lawyers circling!
California Civil Code § 1798.84 specifies considerable penalties for companies doing business with California residents when those companies fail to notify the victims of a security breach within a certain period of time. The damage from such disclosure could be large but so could the civil penalties. The legislation is only one example of a growing trend towards punishing companies responsible for data breaches regardless of their physical location because they do business with local residents.

Loss of Revenue

A more direct loss, that of revenue or potential revenue, is of major concern in any business. When a client is lost, the business suffers not only the loss of revenue but also the damage to reputation. It is ten times harder to recover from this than any other kind of loss. Customer confidence must be re-earned to the extent that it overcomes the bad taste of the past. And this applies to future customers as well. It is generally much more difficult and expensive to win new customers against the headwind of a highly publicized security incident.
Oddly, it is beginning to appear that consumers are easier to appease than corporate customers. For some reason, consumers tend to forget or ignore security breaches for a specific company more frequently than do business customers. However, consumers gradually become more weary of conducting transactions with any company that ultimately lead to more expensive government regulation. Businesses can do everyone a favor by being more diligent in managing vulnerabilities.

Lost Productivity

When systems are compromised, they often become unusable for a period of time. If these are critical systems, significant productivity is lost from employees who cannot perform their jobs. Less measurable than employees just sitting around are the cases where employees can continue working but take much longer to complete a task. This workaround environment can sometimes be difficult to start and equally difficult to stop once the affected systems are returned to service. Even if current records are available at the conclusion of an incident, the data must be resynchronized with the now-operational system.
It is also often the case that many time-consuming activities must take place before a system is returned to service. It must be analyzed for the cause of the failure, rebuilt, patched, additional security considered, and closely monitored for a secondary attack. To add to this, IT employees who must clean up the mess are taken away from other activities that could be more directly focused on producing more income, reducing costs, and enhancing productivity in other areas. The latter point further extends into the opportunity cost of not meeting a market need in a timely manner. After all, there are no IT cleanup guys just sitting around waiting for something to happen. Existing resources must be redirected to put out fires.

Popular Posts