Security Analytics Tools Need Structured Data

“How come my security analytics tools don’t work?”

I have been practicing cybersecurity for more than a few years now, and over that time, I have learned quite a bit about how to be as efficient as possible and make the biggest impact in the fastest and easiest way possible.

Less work and more results are something we all aspire to achieve. However, I am still practicing.

One thing I keep circling back to is a question I hear far too often as I work with customers both large and small. How come my security analytics tool (pick a vendor) did not find that, or how come tool XYZ missed that? I believe that most of the security tools we all know, love, hate and use every day to dig into our data all work pretty well when we implement them properly.

I’m not talking about how big the server is, nor am I talking about cloud, on-prem or hybrid. I’m also not focused on how efficient or fast it might be or data storage techniques. Those aspects are all very important in choosing any given tool, but they’re not what I want to really tune into as one of the issues I see every single day at every project.

I’m talking specifically about how we feed these tools data to get the answers we need and find the needles in the haystack.

Every tool I have deployed and used was built with a certain assumption that the data would be structured — or at least be the same as it was when the code was written. Unfortunately, most of the log data I’ve seen is far from structured.

Sadly, the massive amount of variation between log file formats from the same vendor is really staggering. One vendor I recall that made proxy servers had 14 different log file formats for the same exact product line, which is crazy to deal with in the real world of security analytics.

I’m not sure how this happened. It could’ve been a copyright issue, vendor lock-in or some other bad reason. In the end, we’ve created a monster in our own industry that makes it really hard to actually do our jobs as security practitioners. “Field name and value” — it’s still really unclear to me why these are still different and why we have not all come together to force product vendors to standardize on a naming convention and log format.

We end up with around 50 security tools at any given company all running fine, but none of the logs look the same or deliver the same results. It’s a superhuman effort to track anything across the estate and see the real underlying bad actor.

To fix this, I take a step back and consider what we’re trying to accomplish for any given project or tool. Which questions are we trying to answer? Are we using AI to find unknown behavior, or are we leveraging MITRE TTPs to drill into the data and look for things we already know?

I like getting the largest impact from the least effort. To me, that is getting the data into the same format from any tool or any device so I can read it easily.

Security needs structured data, full stop.

All of the tools we use to analyze logs have a library of usable searches, techniques and content that are already tested, functional and deliver valid results. ML models exist embedded into tools, some of which we can’t even see, read or change that rely on structured data.

We are asking thousands of questions of multiple data sources — sometimes, hundreds of data sources — to identify bad actors. For the most part, the data is not organized or is incomplete, or whole sections of the organization or network might be missing.

This comes down to a few key things that we can resolve, but we must start at the top and work our way toward the common goal of enabling the best protection possible within any given architecture. 

I might suggest a different approach:

• Drive the security analytics project from the top and establish the cross-departmental team of data owners who are accountable for delivering the logs, fully tested, into the program. Consider a team-driven approach with data owners, data parsers, compliance, test and validation, and search testers. Create a top-down “board” that can meet for 15 minutes per week.

• Transform all of your log data into a standardized format to enable the team to ask the same question from many different sources of data. Normalization is typically defined by the vendors, and it’s a matter of parsing, renaming and sorting all of the data to match the normalized naming convention. This can help reduce false positives and make future detections and analyses go faster, smoother and with fewer errors.

Based on my honest observation over many years, I believe 95% of the false positive issues and claims are due purely to bad data.

• Test, validate and test again to make sure the logs coming in are perfectly aligned with your chosen data structure and format. Did I mention testing? Are the field names and values properly aligned? Do the searches now work? Do the same types of data have the same format? 

The last test I always try to get into place is a customer acceptance phase. In many instances, the customer is the NOC/SOC or business analytics team. Have the customers test a specific data source or a combined dataset. Does it deliver the level of results they require? 

If you keep these three key steps or phases in mind with each project, each new device or each log file, you can continue to add new data with a programmatic approach from different sources and vendors — and in different stages — to succeed in creating a useful analytics engine that you can use for years to come.