CI/CD environments and processes are increasingly becoming a key area of focus for both hackers and — consequently — for defenders.
At Prisma Cloud, we believe that visualizing these environments and modeling the relationships between the different objects in them is a key guiding principle for anyone interested in building the necessary capabilities to protect CI/CD ecosystems.
I decided to take the opportunity to share some insights around the elaborate task of engineering a solution that represents CI/CD in graphs, after having spent a year doing so.
Attackers are continuously focusing on hacking CI/CD systems. 2021 alone brought us — among many others — codecov, the npm coa/rc/ua-parser hijacking and the recent and highly disruptive log4j vulnerability.
Once an attacker gains a foothold within an organization’s CI/CD ecosystem, they will look to advance their position within, en route to eventually finding and exfiltrating valuable assets. To move within the ecosystem — for example, from source control management (SCM) to continuous integration (CI), or CI to an artifact repository — the attacker will strive to perceive a graph representation of the ecosystem, looking for the relationships between the different systems.
From the attacker’s perspective, the ecosystem is not a group of isolated and siloed systems but a graph of highly interconnected systems, which contrasts the traditional defender list mindset, where systems and assets are secured individually in a silo.
It’s my experience that both developers and security share similar concerns about the role that CI/CD plays in the organization’s attack surface. I vividly remember a moment in one of my previous companies where I became concerned with the inadequate visibility we had into our CI/CD ecosystem. Ultimately, we had to deploy pipelines that had access to, and credentials of, the most sensitive environments storing our customers’ most lucrative data.
Even though this was the reality, I did not feel we understood the blast radius of a security breach in some of the systems, and whether or not, ultimately, the assets that our customers were entrusting us with — were safe.
At Prisma Cloud, we understand that to effectively and holistically secure CI/CD ecosystems, we must match the attacker’s perspective and think in graphs rather than lists.
Making this decision is the trivial part — modeling and engineering is where the challenge comes in.
CI/CD ecosystems are complex, typically consisting of multiple systems, where each system is often complex by its own merit. In addition, each type of system is usually offered by several different vendors — with both overlapping and distinct features and properties — and each system can often integrate with several other systems by more than one means.
Here’s a simplified overview of a modern CI/CD ecosystem for demonstration:
Source control management: A GitHub organization will consist of numerous users, collaborating on multiple repositories, sending webhook events on a variety of triggers, with endless branches and commits correlating to various branch protecting rules and conditions — and that’s just the tip of the iceberg.
Dependency/artifact management: Organizations typically use multiple package registries simultaneously — for example, npm/JFrog — potentially containing millions of packages each, where each package is nesting its own dependency tree.
Integration: Jenkins pipelines can be executed by both pull and push; be defined inline or by manifests located in arbitrary repositories and branches. Jenkins’ marketplace consists of more than a thousand plugins that can extend or alter pipeline behavior.
Deployment: The OCI distribution specification helps standardize the distribution of container images. However, there are multiple overlay vendors with different authentication schemes — for example, ECR, GCR, DockerHub and others.
Production: Kubernetes consists of around 10 baseline objects of varying complexity but is also highly extendable. Understanding the structure of the underlying infrastructure — whether it’s cloud or on-premises infrastructure — is important for a comprehensive view of the production environment.
The above is a partial and low-resolution overview of a basic CI/CD ecosystem, but it does demonstrate the key resource families that can be modeled in a CI/CD graph, and I’ll refer to them throughout.
Here’s a more in-depth view of some of the interesting considerations and challenges we encountered at Prisma Cloud while modeling and implementing CI/CD graphs.
Some intrasystem relationships become apparent by simple observation. For example, an SCM repository that contains an SCM branch is easily observable, and a simple GraphQL response transformation can generate the appropriate nodes and edges.
In contrast, determining the effective permissions of an SCM user against an SCM branch is a more involved effort. It requires processing the user’s permissions against the relevant SCM organization and against the relevant SCM repository, as well as the branch protection rules of the SCM repository.
Intersystem relationships are also often complex. For example, can you try to accurately, in high resolution, model whether a developer — or, more likely, a DevOps engineer — in your organization can push code directly to production? What about indirectly?
There’s high correlation between the resolution of the data in the graph and the insight an organization can derive from it. The aforementioned basic overview is a low-resolution one — although it covers a lot of ground, it contains little insight.
In contrast, consider a potential subgraph of a container registry. The most basic and low-resolution subgraph consists of a single node that represents the registry as a whole. However, increasing the resolution, nodes representing the registry’s repositories emerge, followed by the images of each repository, the layers of each image in each repository, the content of each layer and so on.
With these high-resolution nodes in mind, not only are we already able to gain interesting insight into the container registry — for example, images’ SBOM — but we have also discovered opportunities to create intelligent relationships between the higher resolution nodes to nodes of different systems. These relationships include, for example, a package in a container image that is identical to a package in JFrog repository — or even more interestingly, a package in a container image that has an identical name but different content in comparison to its JFrog counterpart.
Raw data that can be transformed to graph nodes and edges comes in all shapes, sizes and frequencies.
Roughly speaking, the higher resolution data will be naturally more granular and change more frequently. Following up on the container registry example, the lower resolution nodes — the actual registry and the contained repositories — will change infrequently, whereas the higher resolution nodes — the images and the underlying layers — will change frequently. And those changes will often carry greater details — for example, imagine the changes to a package-lock.json when new packages are introduced.
Additionally, the higher resolution data will often require more advanced extraction and processing methods, while the lower resolution data is often retrievable by simple REST or GraphQL requests.
Let’s circle back to the container registry example. Listing the images of some container registry repository is easily achievable via a REST request . However, deriving the full SBOM graph — with both OS-level packages and application-level packages — of a new image in the aforementioned repository is a significantly more involved process.
Besides the interesting challenge of modeling CI/CD graphs, retrieving the required data, transforming it and storing it in a way that’s performant, resilient and consistent is likewise an interesting challenge.
As previously mentioned, the relevant raw data comes in all shapes, sizes and frequencies. Some data is more appropriate to be pulled periodically, whereas other data is more efficiently retrieved via registration to push events or streams.
The synchronization of all the different data avenues is critical to ensure consumers of the graph are consuming consistent data that is not in a transient state.
It is not enough to rely on ACID stores, as the interoperability between the CI/CD systems is great, where a single change in a high-resolution node may lead to a cascade of altered intelligent edges and nodes.
The process of determining the new relationships in such an event often involves a sequence of operations that have to be performed asynchronously due to both a potentially high number of hops, as well as prolonged execution durations.
In the interim, the graph must remain in a consistent state — albeit not necessarily accurate in terms of real-time representation. How accurate your graph must be in terms of real-time representation is derived from your SLA considerations, which may lead to very different architectures.
Another interesting topic in the realm of CI/CD graphs is user-friendly nodes and edges. Although high-resolution nodes and edges are great for deriving in-depth insights, they’re not all necessarily adequate for the presentation layer, often being noisy or just not interesting for observation in many cases.
In particular, intelligent edges between high-resolution nodes meant for vector calculations are often not intuitive or user-friendly. Therefore, derivative user-friendly nodes and edges should sometimes be generated to present the insights in a user-friendly and intuitive way.
Furthermore, user interaction is another interesting challenge. Users may want to, for example, simulate how an altered configuration of a resource affects blast radius. A translation layer between the user-friendly nodes/edges and the algorithmic nodes/edges can be implemented to achieve the required bidirectionality.
Each type of CI/CD system is often offered by several different vendors. Between the various vendors, there are varying degrees of overlap in terms of features and properties. When building generic CI/CD graphs, our motivation is to be able to present a vendor-agnostic representation of the ecosystem. Otherwise, each integration of a system into our graph will be tedious.
Under ideal circumstances where the overlap is great — for example, container registries — adding integration with a new vendor will merely require a thin transformation layer, and all relevant deliverables will work without much additional effort.
Typically, systems based on standardized protocols or specifications — for example, Git, OCI, etc. — will contain substantial overlap between the different vendors and will be easier to normalize, whereas the nonstandardized ones will be more difficult.
As such, it’s important to strive for normalized nodes and edges, but to also know when to “throw in the towel” and define vendor-specific ones. Some properties and relationships are entirely vendor-specific, where attempts for generic expression are futile.
So we’ve modeled and engineered a graph. We took care of every one of the considerations mentioned above, but that is just the first half of our journey. Now it’s time to capitalize on the data.
Here are some prominent examples:
The lower to medium resolution nodes and edges represent the major resources in a CI/CD ecosystem, such as SCM repositories, SCM users, CI pipelines, container registry repositories, etc. Each one of these nodes is potentially connected to several of the other nodes-making a tabular representation of the ecosystem somewhere between highly confusing and not possible.
For example, imagine the average medium-sized organization, maintaining a few hundred SCM repositories. Each repository is potentially accessible by various means — username or password, SSH key, access token, OAuth token and others. Repositories trigger builds off of various conditions, considering the source branch, branch protection rules and other factors.
Ultimately, understanding which humans — and which applications — have the permissions to trigger which CI pipeline — and under what conditions — relies upon complex queries and conditions that cannot be adequately calculated by means that are based on tabular data representation.
Basic attack vectors become observable in higher resolution data, and the likelihood of correct identification of a vector increases with resolution and volume. The more auxiliary components and guardrails represented in the graph, the more false vectors can be eliminated and the more refined blast radiuses can be derived.
For example, imagine a subgraph of a Kubernetes cluster. Some possible nodes in such a subgraph are pods, containers, image layers, layer SBOM, etc. Additionally, let’s suppose we identified a vulnerable version of log4j in one of the layers’ SBOM.
On the one hand, with enough volume and a high enough resolution, this can allow an attack scenario that leads all the way to the underlying cloud infrastructure — for example, via lateral movement to a privileged pod — or to a database with customer data — for example, reading a mounted secret.
Alternatively, the same scenario can lead to absolutely nowhere — e.g., the library isn’t loaded to the memory or no significant lateral movement is possible.
The latter is nonetheless a security risk. But the difference in severity from the former is significant and it is essential to be able to make the distinction.
The log4j scenario described above is potentially disastrous. However, its graph discovery and representation are reasonably straightforward.
At Prisma Cloud, we’re engaged in researching advanced CI/CD vectors, often involving tens of high-resolution nodes with multiple different intersystem edges between the various nodes.
A nice example of what we consider an advanced vector is a vector, or rather a combination of several different vectors, that accurately describe whether a CI/CD pipeline will be executed on a pull request or a pushed commit to an SCM repository — and if so, what code will be executed in the pipeline, and what sensitive information is potentially accessible to that code.
Sensitive information may indeed be exfiltrated — e.g., codecov — but it doesn’t necessarily end there. CD pipelines often have access to credentials that, unsurprisingly, facilitate continuous delivery. These pipelines will often contain access to write credentials to your container registry, your Kubernetes API and other production-related integrations.
I’ve briefly touched on simulations before, and indeed, simulating different high-resolution CI/CD graph states based on simulated modifications to a live CI/CD system is a powerful discoverability measure.
The relationship between security, DevOps and dev teams involves a continuous cycle of give and take. Applying the most comprehensive and strict security measures for every feature or component in every system might severely hinder development velocity, and consequently, hinder GTM or TTM.
On the other hand, disregarding security will eventually lead to the accumulation of significant security debt, which will become progressively more challenging to account for.
It is therefore imperative to find the right balance. But that’s often easier said than done. As evident, CI/CD ecosystems are complex, and understanding the full effect on the blast radius of lacking a security measure or control is difficult without at least a moderately high-resolution CI/CD graph.
By employing a high-resolution CI/CD graph, different states based on different simulated security controls can be derived algorithmically and visualized in a user-friendly presentation.
The dynamic and fast-paced nature of today’s engineering ecosystem has caused managing our environments with infrastructure as code (IaC), rather than manually, inevitable. IaC is imperative to achieve maximal replicability and restorability.
However, for various reasons, exceptions are often made. Resources are not deployed via IaC, permissions are granted ad hoc, code is not templated, etc.
Machine learning for graphs can help discover patterns and anomalies in CI/CD ecosystems. Applying unsupervised clustering algorithms on graphs can uncover both lower-resolution anomalies — for example, unusual SCM user permissions, unusual SCM repository branch protection rules, etc. — as well as higher-resolution anomalies like unusual Dockerfile base images, unusual Kubernetes manifests, unusual SBOM, etc.
Graph structures and algorithms have excellent synergy with CI/CD ecosystems. Dramatically improved low-resolution visibility and high-resolution blast radius discovery are just some of the possible benefits of representing CI/CD ecosystems with graphs.
However, successfully doing so at scale is a challenging yet fascinating R&D endeavor. CI/CD systems have hundreds of potential graph nodes and edges to be modeled, requiring the orchestration of multiple, diverse and often complex data sources.
Fortunately, the task is as fascinating as it is challenging. There’s virtually no limit to how complex and high-resolution your graph may be.
From exploring and modeling esoteric or extremely high-resolution data through making sure your infrastructure & application tiers support its retrieval and presentation in a way that’s scalable and efficient, CI/CD graphs are an R&D oasis.
These capture the essence of the challenges, dilemmas and thoughts processes that make up for our day-to-day at Prisma Cloud when working on CI/CD graphs.
If you’re looking for a deeper dive into CI/CD security, make sure to read our technical guide on the Top 10 CI/CD Security Risks.