A Primer on Incident Response for Healthcare
Andrew T. Robinson
February 5, 2002
What Can Go Wrong, Will
A wise man named Murphy once said, “If something can go wrong, it probably will.” When things go wrong with your information security implementation, the result is euphemistically called an “incident.” As has been demonstrated many times in the past year, new hacking exploits (like the UNICODE vulnerabilities of Microsoft’s Internet Information Server) and viruses (like CODE RED, which compromised nearly 350,000 Microsoft Internet Information Servers) come out faster than you can patch your system or update your intrusion and virus detection signatures. The window of vulnerability for each exploit may range from hours to months, and any Internet-connected organization is subject to such vulnerabilities.
No matter how much time, money, and technology you throw at your information security implementation, Mr. Murphy will eventually get the last laugh.
An incident is any condition or activity that compromises, or may lead to the compromise, of the security (confidentiality, integrity, or availability) of your information system. Incidents include everything from routine “script kiddie” probes of your network to attacks by organized computer criminals and even hostile nation states. In this article I specifically address intentional acts that violate your security policy or applicable laws and regulations—but these same steps may be applied to any form of incident response, including accidents and natural disasters.
How you respond to an incident, cleverly enough, is called incident response. Incident response may be divided into four phases:
- Detection
- Response
- Recovery
- Analysis
1. Detection
Detection must occur before any of the other steps can occur. The earlier you detect an incident, the easier it will be to assess, recover, and analyze the incident. Ideally, you will detect the incident before it results in a compromise of your information security. Malicious compromises usually do not occur spontaneously—they are preceded by hours, days, or weeks of preparatory intelligence gathering and target-softening, and those activities can be detected.
Without an effective incident response plan, you are more likely to detect the incident after real damage has already been done. At that point, response and recovery become much more difficult, and you are essentially in a disaster recovery mode. Your goal with incident response is to avoid disasters through early detection—but you still have to be able to deal with disasters.
Detection may be accomplished through the use of intrusion detection systems (IDSs), system monitoring, audit trail analysis, or a user's perception that something is "not quite right." Too often the human element of detection is overlooked in favor of technological solutions. It is important to recognize that automated detection systems have significant limitations. A diligent system administrator who reviews system logs and audit trails on a daily basis is often more effective than an IDS—but both human and automated elements play an important role in contingency management.
2. Response
Once you have detected the incident, you must respond to it. Your response will consist of one or more cycles of assessment and containment.
- Assessment is the determination of the nature and extent (severity) of the incident, and the mobilization of the resources necessary to mitigate the threat posed by the incident.
- Containment is the deployment of resources mobilized during assessment to reduce or eliminate any ongoing damage from the incident.
How many cycles of assessment and containment, and the nature of the containment activities, is conditioned by the severity of the incident. If it's a routine probe of your network (even small sites may experience hundreds or thousands of these a day), your response will be limited. If you find out that one of your systems is "owned" by a threat agent, you may have hours or days of cleanup and analysis ahead of you. Your plan for responding to an incident must be flexible because you cannot come up with a set of rote steps for every possible contingency. This implies that everyone in the organization must be trained on how to respond—even if that response is to notify someone else.
While taking effective assessment and containment are central to an effective response phase, two other vital elements are communication and documentation.
- Communication ensures that the people who need to know are informed quickly and efficiently. For a routine probe, the system manager may not need to inform anyone else. For an active compromise, parties may include law enforcement, management, legal, and public relations (the latter to decide how to treat the incident in the media—in some cases the compromise may become public knowledge without any action on your part, in which case PR will be needed for damage control).
- Documentation ensures that all aspects of the contingency management process related to this incident are written down somewhere ("written" in this context includes electronic media such as electronic mail). This includes a description of the incident, how it was detected, and what response and recovery steps were taken. Documentation should occur immediately upon detection and should be updated throughout the process until the incident is considered closed. Complete and detailed documentation is particularly vital if you plan to take legal action against a threat agent—but it's vital in any case if you expect to learn anything from the incident and reduce your risk of a recurrence.
In lieu of communication and documentation, your incident response is likely to be ineffective. You may detect and respond to an incident on a system you are responsible for, but if you do not let others know of the incident, effects on another system may go unnoticed. Without documentation, others will not be aware of the incident and the lessons learned from dealing with the incident will be lost (organizational memories tend to be short without written documentation).
3. Recovery
If your response is effective, the immediate threat should be reduced or eliminated. Now you must recover from any damage that occurred during the incident. If the incident was a routine probe, there may be no damage and no recovery will be necessary. If the incident involved any level of compromise, you will have to perform some recovery steps. Not all damage will be done by the threat agent. In the process of responding to the incident, you may cause damage to your own system. For example, you may disconnect your Internet connection or disable a system that was under attack. Recovery means restoring any functionality that was lost, including functionality you disabled as part of your response.
If a system was compromised by a threat agent, you will have to perform forensic analysis to determine whether there is an ongoing risk. Often, threat agents leave behind “back doors” and other malicious software on systems they have compromised. Some of these are very sophisticated and difficult to detect. The very act of powering down a system so you can analyze its hard drive may trigger malicious software that destroys data.
When developing your incident response plan (at the time of the incident is too late), you need to consider whether you are more concerned about recovering as quickly as possible (recovery priority), or whether you want to pursue legal action against a threat agent (evidence priority).
- Evidence priority — Every action you take on an affected system degrades the value of electronic evidence. If there is any chance you will pursue legal action against a threat agent, you should isolate the compromised system, without logging in to it, by disconnecting it from the network. You should not tamper with the system in any way, including powering it down, until you have consulted with law enforcement and/or an organization specializing in computer forensic analysis. A forensic analysis may take the affected system offline for weeks, months, or years—so your business continuity plan (BCP) will need to include provisions for restoring that functionality on separate, known-good hardware and software.
- Recovery priority — If your goal is to get the system back in production right away, you will need to perform a "known good" installation. This will destroy evidence, but you should never put a system back in production that has been compromised without first reinstalling the operating system and applications. If you have good backup and recovery procedures, you can restore an image of the system from backup media. However, some compromises are active for a long time before they are detected (sometimes years)—so you may have backed up a system that was already compromised. Restoring such a system will only leave you vulnerable. To be sure, you may want to install the system from scratch, including the operating system and applications—preferably on different hardware.
Communication and documentation are as important to recovery as to response. Frequently, the effects of a compromise are more extensive than appears at first glance. By making sure that all interested parties are "in the loop" and that the documentation for the incident remains up to date, you have a much better chance of identifying and recovering from all the effects of the incident.
4. Analysis
Once you have recovered, the final step of the contingency management process is analysis. The analysis step does not need to be performed immediately—especially if you have spent the preceding 24 or 48 hours recovering from an incident—but it does need to be performed. Analysis means getting all the people who were involved in incident detection, response, and recovery together to discuss what happened and how your contingency management process can be improved. Bottlenecks and delays in the process can be identified and reduced or eliminated. Lessons learned are applied to future incidents. You will never eliminate incidents, but you can become very good at responding to them.
Conclusion
A good incident response program does not require a lot of money. Human factors are often more important than technology, and the more you can use all your personnel as force multipliers—particularly for the detection phase—the more effective your incident response will be. Also remember that you will never have a perfect incident response plan. Do not spend a lot of time and treasure trying to get it right the first time. Do the best you can, and use the analysis phase to refine your plan over time.
Simplified Total Risk Management, STORM & StrongCOR are trademarks of RESCOR; RAPID & RSK are trademarks of Andrew T. Robinson.