Critical Incident Management: 2/24/08

Computer networks such as the Internet that do not have central administrative controls or unified security policies should be called open-ended networks. Because of their open-ended nature, there is no realistic way to determine just how many nodes are attached to the network. Regardless of the best efforts of information security officers, no degree of hardening will assure that a computer system that is connected to an open-ended system can be made invulnerable to attacks. However, if systems were designed with the goal of delivering profitable services while maintaining properties such as confidentiality, integrity, and availability, they would go a long way to contributing to an organization's survivability in the face of disasters.

Today's large-scale networks are highly distributed in an effort to improve efficiency and effectiveness by permitting high levels of integration. These levels of integration, while providing great strength of communication between networks, also carry elevated risks associated with unauthorized intrusion and compromise. These risks can be somewhat mitigated by implementing survivability in an organization's systems. Survivability incorporates risk management, fault tolerance, performance testing, and auditing.

Survivability is easily defined as the capability of a system attached to an open-ended network to continue to deliver profitable services in the presence of accidents, attacks, or systems failures.

The terms accidents, attacks, and failures are meant to include all potentially damaging events. Attacks include intrusions, viruses, worms, Trojan horses, and denial-of-service attacks. Any system with an overly restrictive structure because of attack threats may significantly reduce its functionality while directing excessive resources to protect and monitor its assets.

Failures and accidents are risks caused by deficiencies in the system itself, or in an external item on which the system depends. Failures may be attributable to design errors, human errors, hardware failures, coding errors, or corrupted data.

Accidents are usually described as random events such as naturally occurring disasters such as floods, blizzards, earthquakes, etc.

For a system to achieve high levels of survivability, it must react to and recover from damaging events while continuing to deliver efficient and effective services. In fact, reaction and recovery must be at acceptable levels whether or not the cause of the damaging event is ascertained. Levels of survivability are central to the notion that the system is sufficiently redundant that even if significant portions of the system were damaged or destroyed, the system would continue to meet demands.

For example, a survivable financial system maintains confidentiality, integrity, and availability of critical information when nodes or communication systems are not functioning as a result of harmful events. This financial system is survivable owing to its robust design. It recovers and delivers critical services in a timely manner in the face of disaster. The hallmark of a survivable system is the identification of critical services, the essential components that support them within the system and the ability to deliver these services in spite of harmful events. These are some of the key elements of survivable systems connected to open-ended networks:

Resistance to attacks. Strategies include strong user authentication and verification, configuration management, change controls, upgrade and patching policies, audit policies, antivirus policies, e-mail policies; partitioned sub-networks; firewalls; proxy services; network address translation services; redundant data backup copies and critical services; and well-developed risk-management programs.

Recognition of system attacks. Strategies include detecting intrusion attacks and understanding the current state of the system such that evaluating the extent of damage can be accomplished effectively.

Creation of event and transactions logs. These logs must document the external and internal activities taking place on the network. Having details contained in these logs can go a long way to saving your system administration and legal bacon. Many experienced administrators strongly suggest that logs are maintained on Write Once, Read Many (WORM) media. This logging media will prevent a malicious person from deleting his or her harmful activities once done.

Recognition of intrusion attack patterns. Strategies include virus scans, systems vulnerability scans, internal integrity checking, logging, audits, system monitoring, and network monitoring.

Recovery of full or critical services is based on critical asset prioritization, recovery, and business resumption.

Development and implementation of strategies for restoring the following: compromised data, critical functionality, limiting extent of damage, maintenance or resumption of critical services, and the eventual restoration of services as time and resources allow.

Restoration of critical data and applications. Use of alternative services, use of redundant components with same or similar interface, operational procedures to restore system configuration state, containment and isolation of damage, and practiced ability to operate critical services with reduced resources.

Risk management planning requires that risk management decisions and financial balances must be made by senior managers with guidance and recommendations of technical experts in application and data domains, security, and software engineering. System survivability depends at least as much on risk management development and implementation as it does on the technical abilities of the organization's employees. Experts in security and technical issues have the role of providing senior managers with the information necessary to make informed risk management decisions.

In the design of new systems or refitting older systems, survivability imposes structures on all phases of system and software development processes. At the requirement and specification levels, critical assets must be identified. Requirements for damage resistance, recognition, recovery, and resumption should be specifically addressed. System architectures should address survivability equally with other performance properties as capacity, reliability, and maintainability. In the selection of commercial off-the-shelf software, solutions should be chosen with survivability as one of the highest priorities.

Software solution design and implementation should include techniques for containment and isolation, replication, restoration, and migration of critical assets. Survivability solutions must be integrated into both new and existing systems, avoiding systems failure due to attack, accident, or natural disasters.

Critical Incident Management

Connecting to the Internet: Policies and Procedures of Survivability

Popular Posts

Search This Blog

Blog Archive

Total Pageviews