Tuesday, August 21, 2007

Security : Logging and Auditing Aid, Recovery

Final in a series on how to develop secure applications is a focus on a category not appreciated until things go wrong

Often ignored, but then missed in hindsight, this category of issues is one of the reasons being proactive can make an enormous difference to software security. Logging critical user activity can help track user actions and provide non-repudiation. It can also give early warning of attacks and information that helps recover from them as well as prevent them in the future. Last, but not least, it can debug software problems in general. However, all too often logging is treated as optional at best. This column will cover the essentials of logging from a security perspective. It will present the tools, techniques, strategies, and processes involved in efficiently and effectively logging data. Bearing in mind that logs are only useful when they are monitored, this article will also cover the need for auditing and the processes around it.

This is the last in the series of articles that cover the security frame—a framework to evaluate the security of your applications as well as to build security into those very same applications. In previous editions of this column we have covered configuration management, data protection in storage and transit, authentication and authorization, user and session management, data validation, and error handling and exception management. As we wrap up this series, we discuss—last, but by no means least—logging and auditing. It’s a category whose worth tends not to be appreciated until things go wrong. In other words, when things are going smoothly, no one really cares about logging and auditing—so much so that it is often a challenge to convince development teams of the importance of investing in a robust logging and auditing strategy from the very beginning. However, when things do go wrong (and they inevitably do), the lack of logging and auditing can result in significant difficulties in dealing with the failure, whether from a security perspective or otherwise.

Logging

To begin with: Why log?

In my experience with countless developers and development teams, most often this is not a question that causes people to think twice. It is safe to say that in most cases, development teams see the value in engaging in logging. The problem, however, arises in the implementation. Logging is not enforced as a requirement, so it is seen as a nice-to-have, as opposed to a must-have. When users complain about performance, the first thing that is cut is the logging capability. After all, as developers and development teams, we are prone to think, “Nothing can possibly go wrong, and hence we would rarely if ever need to perform any logging. And if it’s going to save us a few milliseconds, why not just get rid of it entirely?”

Indeed, why not? Here are a few reasons why not:

Survivability and securability. You must consider the possibility of failure seriously. This means considering what happens if your application fails. How will you maintain at least the services that are critical, while you work to recover from the failure? Logs can be the key to recovery and to determining what went wrong, allowing you to build strategies to prevent such a failure in the future.

Bug fixing. This one goes beyond the realm of security. Often the nastiest of bugs—timing issues or race conditions, for instance—will show up only in production environments as the application is exercised in a real-world, possibly multi-user, environment. Without logs, tracking down and debugging these one-off problems can be hard, if not impossible, since they are hard to reproduce. Hence, your best bet as a developer is to have detailed logs that can then be used to reconstruct the problem. This can also be useful for bugs in general that are discovered only once the software is deployed or in production on customer sites where installing a debugger and attaching to the offending process might not be possible. All in all, logs can be an effective debugging tool, especially once an application is live.

Health and performance monitoring. A common strategy, especially in large and long-running enterprise applications, is to use the logs as a mechanism to show activity and progress. This is especially true for applications that are in more of a batch processing than an interactive mode. For these applications, log-monitoring software can be used to detect “heartbeats” in an automated manner from the logs and to report progress and activity in, for instance, an enterprise application dashboard.

Compliance.1 While this reason is often misused as a stick to sell everything from firewalls to security software, the fact is that a number of recent regulations, such as the Gramm-Leach-Bliley Act (GLBA), the Payment Card Industry (PCI) Standard, and the California SB-1386 bill, require some level of audit trails. Companies not providing such audit trails throughout their IT infrastructure can be found in violation and be subject to fines and other punishments. Each of these regulations has specific audit trail requirements and, while we will not delve into the specifics of these here, the best practices we are describing will ensure compliance with them.

Accountability and non-repudiation. The need to demonstrate compliance is only part of the need for audit trails, which provide accountability and non-repudiation. These are intended to help associate specific actions with the users who performed or triggered the actions. Without accountability and non-repudiation, the security value of logs is questionable and, in some cases, the financial impact felt if they are lacking can be enormous. Perhaps the best example of this is in online banking or stock trading applications.

Forensic value. Well-maintained and handled audit trails can be extremely valuable when prosecuting the perpetrators of intrusions within a company’s IT infrastructure. Here again it is not only necessary to maintain the logs but, in order to retain their evidentiary value, they must also be handled appropriately, especially after an intrusion has taken place and with regard to aspects such as chain of custody. Even if there is no desire to prosecute, logs can be invaluable in investigating an incident to determine exactly what took place, how and when it occurred, and “what was taken,” so to speak. Such information can also be useful to prevent future attacks by helping isolate security holes that might have been exploited. Not only that, but when logs across different servers (the application itself, the Web and database servers) and hardware (routers and firewalls) are correlated, they can provide a detailed anatomy of the attack that can provide invaluable lessons in defending the system in the future.

Psychological value. Finally, often because of all the reasons mentioned above (especially the last two), an effective and efficient logging subsystem that cannot be easily compromised can act to discourage attackers who are concerned that their attack will be detected or that they will leave a trail that can be tracked back to them. This is especially true for insider attacks, where such audit trails are likely to be very valuable to the investigators and even more damning to the perpetrator.

By now, it is to be hoped that you the reader are convinced of the value of logs and audit trails and of the reasons they should not be the first feature that gets chopped off or turned off in the quest for coming in under budget or on schedule or in post-production to improve performance. Although these might be legitimate business decisions, it is also important to understand what you are doing when you turn off or disable logging on the threat model of the application.

Also by now, all this talk of logging has probably resulted in a number of questions that we will attempt to answer in the next few sections.

What Should I Log?

To answer this question, it is best to think about two types of information: meta information and miscellaneous information. Meta information provides data for context, and event-specific information, which provides details that explain why the event warranted logging. The meta information should tell you when the event took place, who performed the event, and where the event was triggered from. With this in mind, meta information must include, at a minimum:

Date and time of the event. Without time information, a log can often be meaningless; time information makes it possible to backtrack and determine when an attack or compromise might have begun. To be most effective it is best to make sure that some level of time synchronization exists among the different servers, applications and hardware that will perform logging. This will allow for end-to-end log analysis.

User/originator information. It is critical to store information about who triggered the event. Like date and time information, these details are crucial to the usefulness of the logs from an audit-trail perspective—otherwise they provide little to no accountability, and non-repudiation cannot be achieved unless the action and event are tied back to a specific individual. However, it is important to note that the user running a server may not necessarily be the user performing the action. This can happen, for instance, if the server impersonates a higher-privileged user (e.g. LocalSystem). In such cases, it is critical to log not only the current user ID but also the true user ID. There might also be a need, if appropriate, to store the IP address of the user. This is especially useful for intranet-based applications. If applications are Internet-accessible, it is important to remember that the IP address obtained may not be the true IP address, especially when network mechanisms such as NAT (Network Address Translation) are in use.

Miscellaneous information. Based on the needs of specific applications, it might also be useful to log programmatic information, such as the caller of the function performing the logging and the values of parameters passed to the function. It is also tremendously advantageous to have source code references to aid in debugging: for instance, the name of the source file and a line number. Most programming languages have macros or functions that can provide that information, and thus it is fairly easily accessed. Finally, depending on whether the log file is shared across multiple applications and processes, it might also be necessary to log the application name, process ID and, potentially, even thread ID.

Now, with regard to actual events, some critical events must be logged. (See Table 1.) It is best to view these events within the context of the security frame, which should be familiar to you by now.

What Should I NOT Log?

A number of the events mentioned above can deal with sensitive information. Now that we have said that, we should be logging the occurrence of these events, the question is: How much information do we log? The general rule of thumb is that any information intended to be kept confidential should never be logged—not in its clear text form, not even in its encrypted form. This includes all sensitive data such as passwords and private information (e.g. Social Security numbers, or credit card information). Further, to control the size and overheads of logging, avoid logging entire database tables or record sets. If it is absolutely necessary, development teams may consider logging queries, the size of the record set, and whether the access was successful or denied. Similarly, it is best to simply log a reference to the code, rather than logging actual source-code chunks. A filename and line number should be all the information that is necessary to provide developers with an indication of where in the source code the event occurred.

Where Should I Log to?

There are quite a few options in terms of where the logs should be written. The basic requirement for a log location is that it be securable. This implies that it has adequate access control to prevent unauthorized tampering of the log file from outside the application: for instance, by directly editing the text file in a regular text editor. With this in mind, the general recommendation is that the log files be placed on a different and dedicated log server—possibly on a separate VLAN. From a security perspective, the advantage of doing this is that even if the attacker successfully compromises the application server, he or she would still need to get past another barrier to compromise the logs and/or delete traces of the malicious activity. All updates to the remote log server must then be performed over a secure channel to prevent tampering. The authors have even run into cases where the security demands are so high that logs are directly written to write-once media, such as DVDs.2

For many applications, however, an elaborate setup such as the one described above may not be practical. If the threat model does not demand such an approach, there are single-machine alternatives as well that can be effectively secured. The operating system itself provides logging options. For instance, the NT Event Log on Microsoft Windows and syslog on Unix flavors can be used through well-defined APIs. There are, however, a few caveats to bear in mind when using operating-system-based logging. First, such logs are a shared resource across all the applications and operating system components. This implies that these are not meant to be used for extensive logging. Second, and in many ways related, the size of such log files is often controlled by the operating system, with little granular control. Hence, it is quite possible that when you do go into the logs to check on activity, the log entries for your application have been replaced by those from some other application. It does, however, remain an efficient logging mechanism, especially when logging limited information of a highly critical nature or, for instance, logging information about an application’s logging subsystem itself.

Besides the operating system, most other parts of the infrastructure do have some level of logging capabilities. Most hardware (routers and firewalls), Web servers, application servers and database servers have logging capabilities that are configurable to determine how much and what type of information will be logged. Database transaction logs, in fact, can offer a level of detail that allows them to be replayed in the case of database failures to repopulate the missing data into a fresh database. The problem with this level of logging in general, however, is that it lacks the application context. A Web server log might be able to tell you the specific HTTP return code in response to a request. However, it will struggle to tell you specifically what business object was passed in, how it was processed, and what the business response was. To obtain that level of detail, developers are most often required to create their own log entries, whether that is accomplished by writing to one of the logs already described or to a custom log file reserved for just this application. Custom logs can go a long way in eliminating false positives and saving time by providing that level of detail.

How to Log?

In their most basic form, logs are typically stored on the file system or in a database. Hence, developers have the option of using raw platform APIs for creating and writing to such logs. However, as one might expect, this can be inefficient from both a performance and a productivity standpoint. There are, therefore, a number of more elegant ways to log data from within an application.

First, most programming frameworks and operating systems provide some level of access to at least the operating system logs. For example, the syslog API on Unix systems or the System.Diagnostics.EventLog class in .NET provide access to the /var/log/messages and the NT Event Log, respectively. As mentioned above, such logging capabilities come with a number of caveats but nevertheless represent an easily accessible option. With .NET 2.0, an important new feature was health monitoring,3 which provides you with the rich capabilities of a built-in logging subsystem but eliminates some of the traditional bottlenecks associated with such systems. Health monitoring is tremendously configurable, allowing even for defining parameters such as thresholds when logging and alerting should start and when they should stop.

Third-party libraries such as the log4* family4 and the .NET Enterprise Library5 provide another option. These libraries provide full-function logging capabilities and are tremendously configurable. Further, they can easily integrate with different application types, including thick clients, Web applications, services and even controls. Especially in the case of the Apache Logging Project, logging APIs are available in a wide variety of languages, from Java and .NET to C++ and PHP. These logging APIs also allow for a variety of log sinks, including the more traditional file system, database or syslog as well as message queues and system management software solutions. One important feature that most third-party logging solutions support is the notion of log levels. Log levels, typically Informational, Debug, Warning, Error or Fatal, can help control the level and volume of information that is logged. Production systems should by default only log Warnings and higher, unless a problem is being debugged in production.

Creating custom loggers used to be fairly common, especially before rich third-party libraries were available. Custom log libraries would essentially implement many of the features now available in the third-party libraries. In most cases, we strongly discourage applications to build their own custom logging implementations. In the worst case, the team might choose to extend one of these loggers to, for instance, support a new log sink, such as a custom-mainframe-based logging protocol.

Log Concerns

While logging is critical to the security of the application, there are a few considerations that must be kept in mind in order to securely create a logging implementation. It is therefore critical when creating the threat model for the system to consider the threats to the logging subsystem. The common threats against the logging component include:

Denial of service. The logging subsystem should be implemented to account for disk space utilization. Since the logger will be typically saving all the logs to some persistent data store, it is critical to ensure that the logger is not responsible for exhausting the disk space on that data store, which in turn could result in the logger or potentially the entire application crashing, or a denial of service. The logging subsystem must implement disk space throttling to ensure that disk space utilization never exceeds a fixed quota. This can be implemented in a number of different ways. For instance, consider the use of log rotation, wherein logs are archived to a different location when they reach the maximum size or after a specific time period. A simpler scheme would be just to zero out the log and start again each time the quota is reached.

Log wiping. All log files should have strong access controls defined on them so that they cannot be deleted by an attacker. As one would expect, this is a typical step in the attacker’s methodology so as to wipe out any traces of his or her malicious activity. Storing the log files on a separate and hardened log server as mentioned above is therefore regarded as a good practice. However, even something like privilege separation, where the application identity does not by default have write access to the log file can act as risk mitigation.

Log bypass. Attackers will often try and bypass the logger again as part of being stealthy. Typical log bypass attacks attempt to flood the log so that the maximum log quota is reached. A number of systems will at that point stop logging entirely, thus allowing all future actions to go through without ever being logged. This would allow an attacker to perform malicious activity without resulting in that activity showing up as an audit trail. This can also happen as described above if the log service itself crashes. It is therefore useful to have a watchdog timer that will automatically restart the logging service as soon as it detects the logger is not available.

Log tampering. Attackers may also attempt to tamper with the contents of the log files, either to wipe out their activities or to create false or malicious log entries. Log bypass is discussed above. Log tampering is typically performed by injecting malicious meta characters, such as carriage returns and line feeds or cross-site scripting characters—especially when the log is viewable in a browser as part of a Web application. This last threat can result in critical vulnerabilities, especially because the log files are typically viewed by administrative or power users.·

Besides the threats described above, two other issues are critical—log overhead and log overload. Log overhead is unavoidable, and an effort must be made to minimize it. It represents the performance penalty paid as a result of using the logging subsystem. Many different strategies exist to optimize the logger. For instance, bear in mind that overheads tend to be more significant for smaller operations; hence it is important to avoid opening and closing file system handles or database connections for every log operation. Along similar lines, it is vital to batch log operations together. A caching strategy can help implement such batching. As with any delayed disk write, the risk always exists that the system might crash or be powered off before the cache has been flushed to disk. Hence the cache size and the buffering time must be carefully tuned for performance as well as security. Another common strategy is to make use of threads for performing the heavy disk operations asynchronously while letting the application continue to make progress. Finally, log levels can also be used to control the volume of information that will be written to the persistent data stores.

Logging best practices are useful to define. (See Table 2.) Some of these best practices are critical for the security of the logging subsystem or the application in general, while others are defense-in-depth strategies that will enhance the security of the system as a whole.

Auditing

One of the most important uses of log files from a security perspective is in forming the audit trail. An audit trail represents a record of a user’s activity as he or she uses the system. Consider the scenario where a user logs into his or her online banking account and transfers $100 from one account to another. An audit trail must be designed in this case to make it hard for that user to deny that he or she performed the transaction after the fact. This is just a simple example; in reality, there are many other events that should and would be logged in this case. In fact for most systems we would like to go farther and maintain an audit trail of when the system is restarted or when users are added and deleted—essentially all of the events mentioned in the What Should I Log? section above.

Essentially, an audit trail is intended to provide for accountability and non-repudiation, and both of these, as mentioned above, are valuable for their evidentiary value, among other things. Besides this, however, audit trails are also useful in identifying which parts of your system are most frequently used, for instance, or wherein the bottlenecks lie. Metrics can be gathered from the production system and then analyzed and used to optimize performance by tuning system parameters such as cache sizes and timeouts. For instance, one common argument against short and secure session timeouts (as described in our article on User and Session Management) is that most users will complain about having to login a second time. An audit trail can be a good source of empirical data that shows if this is indeed the case. For instance, with a particular setting for session timeouts, do most users timeout, or do they explicitly logout? If the latter is true, perhaps the session timeouts can be tightened.

Audit trails are also often required as part of a compliance requirement. Two examples that come to mind are the Gramm-Leach-Bliley Act and California’s State Bill 1386.6 Having a strong audit trail is considered part of due diligence in maintaining the security of the assets for which the application is responsible. This can often save an organization from large fines and audits in the event that they do get compromised. An audit trail can be used to prove that the organization did everything reasonable in protecting itself and its customers.

As mentioned above, however, an audit trail is only valuable if it is reviewed periodically. As an organization or a team, it is therefore critical to define roles and responsibilities and a workflow for the various types of events—especially the ones that are significant from a security viewpoint. Additionally, thresholds should be defined and tuned over time as the team learns more about the system in production. For instance, consider repeated failed logins within a short time span. This is most likely a brute force attack at work, and the operations staff might want to take remedial actions, such as investigating the origins of the attack, or perhaps warning users whose accounts seem as if they might have been compromised. Obviously such a system would need to account for the fact that users will type in the wrong password periodically, and therefore the thresholds must account for such behavior to avoid expensive false positives. The health monitoring feature introduced in .NET 2.07 can be extremely effective in helping to define such thresholds as well as in extending the basic events available by default to support application-specific events.

Log monitoring, as one would expect, can be done in two ways: manual or automated. Manual approaches have the advantage of high accuracy and focus. They also tend to be expensive due to the pure cost of man-hours and the fact that most real-world applications will be generating thousands of log entries every hour, and hence the volume of data to be sifted through can become inundating and the staff can end up with the task of finding the proverbial needle in the haystack. Techniques such as using multiple log files and log levels, both of which were discussed above, can be used to make this process more efficient. However, as with most human-driven tasks, it carries an inherent cost.

Automated analysis is becoming increasingly more common, and the tools in the space are maturing. The major area of research for the tool makers centers around the elimination of false positives and false negatives. False positives occur when an event or alarm is triggered; in reality, however, it turns out to be just an error in the heuristics of the log monitoring software. False positives are like the boy who cries wolf: They train the recipient of the alarms to ignore them until finally a true alarm does occur, with predictable consequences. False negatives, on the other hand, can in some ways be even more dangerous, since they prevent the organization from knowing when true events have taken place and let attacks go unnoticed, thus defeating one of the main reasons for an audit trail and log monitoring.

However, automated log analysis does have its advantages. For instance, most of the commercial systems available today have the ability to integrate with systems management software and can thus provide features such as callout trees, wherein an operational person can be called or paged in an automated manner when some threshold is reached. Further, such software also has the ability to define rich workflow scenarios that can take into account, for instance, average response times versus expected response times, staff being on vacation or otherwise inaccessible, and escalation paths.

In most cases, the automated systems described above tend to be real-time and use one of two approaches: They either attempt to detect attacks and other significant events through the use of signatures, or they do so by looking for patterns and anti-patterns (anomaly detection). Signature-based detection is obviously dependent on the signatures being updated and available from the vendor. Furthermore, it implicitly relies on the vendor’s capability to turn around signatures quickly and effectively to minimize both false positives and negatives. For instance, until a few years back it was relatively easy to bypass a popular intrusion detection system by using ’ or 2 > 1;— as opposed to the canonical ’ 1=1;—. This represents the inherent weakness of signature-based systems. Anomaly systems, on the other hand, typically require a little more hand-holding, especially during the initial deployment. The aim here is to train the software to identify regular usage patterns and therefore to identify deviation from such patterns. When such deviation occurs, the monitoring system can trigger an alarm that warns the operational staff about a potential problem. Obviously, the false positive rates on such systems tend to be fairly high initially; however, once they are in full production mode, anomaly detection systems can be quite effective.

In practice, a commonly used third option is to use a semi-automated approach. Typically this involves designating a specific individual with performing log analysis but doing so with tools such as log parsers and analyzers that can convert the raw log into a form that is more easily humanly readable and that allows for the use of post-processing techniques such as trend analysis. Another approach that falls into this category but requires some effort from the development team is to build a custom, lightweight “intrusion detection system” for the application. (The authors created one of these as a proof of concept, called Validator.NET.8) Such subsystems tend to be more effective, since they are far more intertwined with the context and business logic of the underlying application than an external system or piece of software would be. This approach is much less likely to result in false positives or negatives. In fact, development teams and organizations can extend the Validator.NET project with minimal effort and primarily in a declarative and configuration-driven fashion. Implementing such a subsystem for individual applications can be done quickly and efficiently.

Conclusion

Logging and auditing are another category of issues that you don’t miss until you need them. Unfortunately they are also a category of issues that are hard to introduce in hindsight and thus must be considered from day one while thinking about system requirements and design. In our experience, while most applications do perform some basic logging, they often don’t consider its audit trail aspects and rarely if ever engage in any kind of log monitoring, rendering the logs as purely a debugging tool rather than an aid that can help in thwarting attacks and preventing future ones.

Summary

In this, the last of our articles on the security frame, we have covered the important category of logging and auditing. It is important to consider the key benefits that logging, followed by auditing, brings to the table. Under this wide umbrella it is also important to bear in mind, from a security perspective, which events must be logged and which data elements should never be logged. Further, as you design and build out your applications, consider the various options available for performing these critical security functions with regard to logging frameworks and log sinks as well as with auditing strategies and tools. Finally, there is also a lot of knowledge that has been gained over the years that formulates best practices around tuning and optimizing logging subsystems—not only to improve performance and decrease overheads, but also to reduce and eliminate false positives and false negatives.

0 komentar:

Powered by WebRing.