The following post on SOC metrics is adapted from the book, “Elements of Security Operations,” a guide to building and optimizing effective and scalable security operations. Download a free copy today.
Some metrics that security operations centers (SOCs) widely use to evaluate their performance have the potential to drive poor behavior.
One example is mean time to resolution (MTTR). This is a fine metric when used in a network operations center (where uptime is key) but it can be detrimental when used in a SOC. Holding security analysts accountable for MTTR incentivizes them to rush to close incidents rather than rewarding full investigations that feed learning back into the controls to prevent future attacks. Similarly, ranking performance by number of incidents handled may lead to analysts “cherry picking” incidents that they know are fast to resolve. This will not produce better outcomes or reduced risk for the business.
Another poor metric is counting the number of firewall rules deployed. 10,000 firewall rules can be in place, but if the first bypasses the rest (e.g., any-any), then they are useless. This is similar to measuring the number of data feeds into a security information and event management platform (SIEM). If there are 15 data feeds into a SIEM but only one use case, then the data feeds aren’t being utilized and are a potentially expensive waste.
SOC Metrics That Matter
When determining good metrics for your business, always keep in mind the mission of the SOC and the value it provides to the business. The business wants confidence that the SOC can prevent attacks and that if/when a breach does occur, then the team is able to handle it quickly, limit the impact and learn from it. Good metrics should provide insight into whether the business should have confidence or not. There are two types of confidence to focus on: configuration confidence and operational confidence.
Configuration confidence is knowing that your technology is properly configured to prevent an attack, that you can automatically remediate it and/or that the proper intelligence can be gathered for analysis by a human. Example questions to answer are:
Are the security controls running? Oftentimes a “temporary” change is made to controls and is inadvertently left in place. A developer may need a specific port to be opened to perform a test and that port remains open after the test is completed, providing an access point for an attack.
How many changes are occurring outside of the change control policy? The change control policy should be followed in every change without any exceptions. Any deviation to the defined process should be noted as it is relevant to the business’s confidence in the configuration of security controls.
Are the technologies in place configured to best practices? Once a technology is in place, it is rarely a “set it and forget it” situation. Care must be taken to continually evaluate the configuration against best practices. If the measurement of controls against best practices is low, this can drive a plan to increase adherence. If the metrics drop, then a look into why this is happening is warranted.
What percent of features and capabilities are being utilized? The plethora of security technologies is overwhelming security operations. Many of these technologies are poorly utilized, resulting in a false understanding by the business regarding the actual coverage in place. It can also lead to the purchasing of duplicate features, which exacerbates the issue of too many technologies. Measuring the percentage of feature use can provide the business with a simple understanding of actual value being provided by tools vs. perceived value. For example, what percentage of traffic flowing is visible to analysts? Estimates state that 70-80% of traffic is encrypted. The business should know how much traffic is being analyzed in a SOC and if SSL decryption technology is being used.
Operational confidence is knowing that the right people and processes are in place to handle a breach if/when it occurs. Example questions to answer are:
How many events are analysts handling per hour? This is known as events per analyst hour (EPAH). A reasonable EPAH is 8-13. If the EPAH is too high, say 100, then this indicates that analysts are overwhelmed. They will rush investigations, ignore events and not be set up to properly protect the business. Also, note that it is important to measure per hour and not per day, as an analyst’s tasks should shift throughout the day and shift lengths can vary, causing this number to skew. This metric should not be gathered to compare employees but rather to show the effectiveness of an entire security operations organization.
Are there repeat incidents flowing into the SOC? If threats are properly investigated, then the outcome should feed back into a centralized set of controls that synchronize your various tools for future protection. Repeat incidents flowing into a SOC indicate a failure in this feedback and sync of controls.
Is the SOC handling alerts for known threats? This also indicates a failure in the controls because all known threats should be blocked prior to affecting the business and being passed to the SOC to investigate.
How often are there deviations in SOC procedures? This metric can indicate the need for employee training on the procedures. It may also illuminate out-of-date procedures that need to be updated.
Metrics should be used to improve protections and provide confidence to the business that the security operations organization is executing on its mission – which requires measuring quality, not just volume. Each metric has specific and limited value; no one metric tells the whole story, but together, they can help drive continued improvement and confidence that the business is properly set up to prevent and contain a breach. To learn more best practices for building effective security operations, download a free copy of our book, “Elements of Security Operations.”
Sign up to receive must-read articles, Playbooks of the Week, new feature announcements, and more.
Get the latest news, invites to events, and threat alerts