Amazon Security Lake: Centralized Data Management for Modern DevSecOps Toolchains

By David Melamed

Updated November 5, 2024

a cloud with the words amazon security lake, centralized data management for modern dev

AI Summary

ChatGPT

Gemini

Perplexity

Claude

Grok

AWS introduced its Amazon Security Lake service in May 2023, as the heir to AWS CloudTrail Lake, a new data lake that serves to augment a lot of the capabilities, services, sources, analysis, and transformation the CloudTrail data lake can provide for security management. When doing the research on this service which is gaining adoption, I stumbled upon the roundup below, which provides a good comparison between the two services. In this post, I’d like to dive into the AWS Security Lake capabilities, why this is an excellent new service for AWS-based operations for powering up your security engineering, and wrap up with a useful example of how to get started.

Why Do We Need Another Data Lake?

If we look at the current AWS service catalog, there are quite a number of data sources we leverage on a day-to-day basis to power our cloud operations –– S3, CloudTrail, Route53, VPC, AWS lambda, Security Hub, as well as third-party tooling and services. All of these data sources rely on different and proprietary formats and fields. Being able to normalize this data will make it possible to provide additional capabilities on top, such as dashboarding and automation–which is becoming increasingly important for security management and visibility.

This is something we learned early on when building our own DevSecOps platform that ingests data from multiple tools and then visualizes the output in a unified dashboard. Every vendor and tool has its own syntax and proprietary data format. When looking to apply product security in a uniform way, one of the first challenges we encountered was how to normalize and align the data from several best-of-breed tools into a single schema, source and platform.

a diagram showing the different types of data

Our cloud operations today are facing the same challenge. The question is - how do we do this at scale?

This is exactly what the security data lake comes to solve.

Amazon Security Lake provides a unification service that knows how to ingest the logs and data from myriad sources––whether native AWS services, integrated SaaS products or internal, homegrown custom sources or even on-prem, takes these data sources’ output from the unified format called ASFF (AWS Security Finding Format) into parquet using OCSF schema framework’s format, which is the backbone of Amazon Security Lake, and stores them into S3.

AWS is betting heavily on OCSF, which is an open source framework launched by Splunk and built upon Symantec’s ICD Schema, that AWS is contributing to significantly. OCSF provides a vendor-agnostic, unified schema for security data management. The idea is for the OCSF format to provide a framework for data security management that organizations today require.

Getting Started: Security Data Lake in Action

Once the data is normalized and unified into the OCSF schema - which can be achieved by leveraging an ETL service like Glue, it is then partitioned and stored in the parquet format in S3, and any number of AWS services can be leveraged for additional data enrichment. These include Athena for querying the data, OpenSearch for search and visualization capabilities, and even tools like SageMaker for machine learning to detect patterns and anomalies.

You can even bring your own analytics and BI tools for deeper analysis of the data. This security data is ingested from the many sources supported by the flexible format that is column-based. This also makes it economical, and bypasses the need to mount the entire query in-memory, making it possible to connect it to analytics and BI tools as a subscriber, on top of the lake. (A caveat: the service itself is free, but you will pay on a consumption basis for all the rest of the AWS tooling–S3, Glue, Athena, SageMaker, ...).

Another important benefit is for compliance monitoring and reporting on a global scale. This data lake makes it possible for organizations with many engineering groups and regions to apply this service globally. Therefore, engineering organizations with many accounts and regions will not have to configure this 50 separate times in each account, but can do this a single time by creating a rollup region. This means you can rollup all of your global organizational data into a single ingestion feed into your security data lake.

a screenshot of a web page with a number of people

What is unique is that once the data is partitioned and stored in this format, it becomes easily queryable and re-usable for many data enrichment purposes. The Security Lake essentially makes it possible to centralize security data at scale both on a source level and infrastructure level––from your own cloud workloads and data sources, custom and on-prem resources, SaaS providers, as well as multiple regions and accounts.

As a strategic new service for AWS, when first launched, it already came supported with 50+ out-of-the-box integrations and services from many security vendors from Cisco to Palo Alto Networks, CrowdStrike and others, to help support its adoption and applicability to real engineering stacks.

A DevSecOps application of the Security Data Lake

In order to understand how you can truly harness the power of the AWS Security Lake, we’d like to walk through a short example that helps capture (really only the tip of the iceberg) of what this security lake makes possible.

In this example, we’ll demonstrate how to use the Security Data Lake with one of the most popular security tools - Gitleaks, for secret detection. We will use Github Actions to add Gitleaks to our CI/CD to detect secrets.

Once our CI/CD runs it will send the data to our Security Hub which is also auto-configured to send data to our security lake. This is then stored in an S3 bucket, and the Glue ETL service is leveraged to transform the ingested data into the ASFF format for the OCSF schema. A Glue crawler monitors the S3 Bucket, and the data, once transformed, is sent to the Glue Catalog, which holds the database schema. This data is now queryable via Athena to extract important information, such as secrets detected in certain workloads.

a diagram of a web application with the words demo ingesting custom security findings

The Repo

This repo consists of a simple Gitleaks example including secrets for detection to demo how it works and sends the data to Security Hub.

a screenshot of a web page with the text glittake to security hub

Configuring Gitleaks

Next, we configure Gitleaks to send the detected secrets to the AWS Security Hub

a screenshot of a black screen with yellow text

The Security Hub Schema

The Security Hub schema is configurable with simple Python code:

a black background with white text and a black background with white text

Detected secrets in action:

a screenshot of a computer screen with a program running

You can then navigate to Security Hub and see the findings there:

a screen shot of a web page with a bar chart

a screenshot of a web page with a number of words

While useful for visualization and understanding that our configurations are working as expected, the queries available in the Security Hub are basic, and it’s not possible to enrich the data. We want to be able to know if this secret, in the context of our own systems, is even interesting and needs to be prioritized for remediation.

Let’s navigate to the Security Lake.

In our Security Lake, it’s possible to see all of the configured sources:

a screenshot of a web page with the security tab highlighted

Once in our Security Lake we can search for the Athena service, and find our data source.

a screenshot of a computer screen with a web page

We locate our data source, where we can then see all of the tables we are able to query, where each data source has its own table.

We then run our query to try and find high severity secrets in a specific region.

And we can see the resulting output:

a screenshot of a computer screen with a text message

With the data sources now available in a single queryable location - cloud workload data alongside DevSecOps toolchains, it's now possible to run complex queries––everything from IP reputation to severity. With all of the many findings our tooling today outputs and alerts about, it’s now possible to minimize the possibilities to relevant context, and prioritize remediation.

Why Security Data Lake is Exciting

The Security Data Lake is set to help with security data format heterogeneity. By creating a single and unified standard, it becomes easier for developers to leverage, enrich and build upon this data––likewise to test and launch services.

By providing a scalable solution for both the data sources and the global resource coverage, engineering organizations can apply data enrichment capabilities across services, tooling, and regions, providing greater context and correlation of security findings. All of this together simplifies compliance monitoring & reporting, programmability, and automation that together provide more resilient and robust DevSecOps programs for engineering organizations.