When we hear the term Policy as Code, it’s for the most part associated with the enforcement aspects of applying policy at scale within organizations.  Whether in the context of GitOps or just popular open source projects like Open Policy Agent (OPA), applying global policy is the end goal.

However, OPA as an open source project is quite unique in its capabilities. There is a whole pre-enforcement set of tools that provide incredible insights into our systems that have long been overlooked.  A new wave of infrastructure drift has arisen in the cloud world due to the layers of abstraction and automation. When building Firefly, we channeled the power of OPA as a policy discovery engine, not just an enforcement mechanism.

This post is going to dive into a new way of thinking about Policy as Code, when engineering highly distributed, cloud systems at scale.  It will demonstrate how OPA can be leveraged for more than just enforcement of policy, but for learning deep insights about your systems, their resource utilization, and ultimately customizing the policy you want to apply based on this invaluable information.

What Is OPA and How Does It Work?

For those who are familiar with Open Policy Agent, you can move on to the next part. However, in order to get everyone on the same page, we’ll quickly run through OPA and how it’s used today in large-scale distributed systems, such as microservices or Kubernetes-based systems.

OPA is an open source project hosted by the CNCF (Cloud Native Computing Foundation) that is built for cloud native systems and combines an easy to learn, dev-readable language (Rego), along with a policy model and API, to provide a universal framework for applying policy across your stacks. This enables you to decouple your policy from your code, and apply it as often as needed, independent of code changes or deployments.  

OPA is essentially a flexible engine that not only enables you to enumerate and write policies but also has powerful search capabilities that don’t require the learning of any new custom syntax (as with other databases, for example) that can be applied to any JSON data set.  

Under the hood, the way policy enforcement works is that in order to apply a certain practice or policy across your systems, you have to do so based on a pre-existing event or input in the system.  The policy action is taken once these events or inputs are identified. Therefore, before we decide what to do with this input (for example, allow/deny in policy terms), we need to first verify the input. OPA as an engine is able to verify rules and policies upon a dataset. The consequent actions taken are dependent on what is chosen to be defined.

Leveling Up Policy as Code With OPA

As part of the task of building a cloud asset management tool, I have discovered the importance firsthand of understanding what’s really happening in our cloud. Many times, due to mistakes or even just simply lack of knowledge, resources are misconfigured.  

These misconfigurations can lead to future issues, whether in the form of functional and cost consequences, or more concerning security consequences. These types of misconfigurations or mistakes include anything from a data store that is unattached to bleeding costs through riskier errors, such as a service that is configured with overly permissive access that can be a critical security threat.

Built for parsing JSON in a world where it is largely ubiquitous for configuring infrastructure, OPA is able to traverse hierarchies, and scan attributes and properties for policy definition, and has the added benefit of being completely external to your data source. This practically means that any data source that can be extracted and compressed to JSON (even a very large JSON file) where a key/value pair can be determined can easily be searched and parsed by OPA to extract insights related to your systems and its resources.

One built-in feature that is important for the enforcement of policy, but is a critical step before that, is its dynamic classification capabilities.  In order to be able to extract data about resources that are misconfigured or are in a state that could be problematic in a cloud deployment, I simply need to search by pre-defined (i.e. existing) criteria.

Why this is so unique requires us to take a look at how this could be achieved using other technology.  Let’s say I want to pull datasets from two different sources (for example an Elasticsearch cluster and MongoDB database), combine the data, and extract a certain fine-grained insight.  This would require me to first perform a retrieval with a proprietary syntax for each.  After the data is retrieved, I’d then need to "join" it intelligently.  Once the very large dataset is combined, I’d then need to unify it to a single format, just to be able to parse both data sources easily.

Now let’s consider this using OPA instead.  

By exporting the required data to JSON, I can already bypass the proprietary retrieval and joins required just to get started.  By converting the queries very simply in the dev-friendly Rego syntax, I am able to search multiple disparate datasets, with a unified language, and essentially filter by the delta of our smallest search criteria.  This not only democratizes this kind of search, as it doesn’t require a database expert, it also makes the process significantly shorter, simpler, and much more flexible and customizable.

Sometimes it’s just as important to know what you have in the first place before you decide what you can actually do with it, and who has which permissions.

OPA for Policy as Code in Action

OPA provides a great platform to write complex policies to identify many things such as anomalies, misconfigurations, or poor practices.

Below I’ll demonstrate through real-world examples of parsing and extracting relevant datasets with and without OPA. These are great ways to leverage it practically for many use cases. Here are just two code samples.

Example #1

Let’s assume you are the CISO of a large organization with multiple AWS accounts, and you want to get all the active IAM users in your accounts that do not have an MFA device configured, which is required to comply with company security standards. We can extract the following list of users from all of the AWS accounts as JSON (using the AWS API, which is usually a pretty complex task, or basically with one command with Firefly):

As we can see, some of the users have no MFA device associated with their account.
In order to identify these users with OPA we can write a simple policy to match every user without MFA:

Now we have our dataset and our policy.  Using simple Golang code we can get or match assets or IAM users (in this case).

The result will of course be (a more concise version of the result above):

Example #2

As a DevOps Engineer, I would like to get all Kubernetes Deployments with a latest image tag (which has caused problems and inaccuracy of the image running in the Pod) and fix them.

Therefore, I extract a list of all of the live Deployment YAML configurations from multiple clusters (which can be done using the Kubernetes API with a bit more complexity and work or again with one command in Firefly) as JSON:

Aside from our prod web app, we can see other deployments with the latest tag as well or without tags at all. We will write a policy to identify the deployments without a pinned image tag (latest by default) or with the latest tag:

The result will be of course:

What this makes possible is extracting important insights, meaning which data answers this particular queried subset. Then we can intelligently decide how to apply the most appropriate policy.  The difference in the level of complexity is incredible.  Suddenly a process that formerly required seasoned experts, such as a database engineer, can be performed by anyone able to learn the Rego syntax (which is dev-readable and understandable).

This flips the current OPA paradigm entirely from its most popular use case of parsing and categorizing information and data (that is completely abstracted from the user), and essentially just moving ahead to enforcing a global policy based on this data. In OPA’s documentation, it actually notes very clearly that it decouples policy decision making from policy enforcement, and yet OPA is still largely used for the enforcement.

Choosing Your Policy Wisely With Code

With this new way of thinking about OPA as a unified data retrieval engine, you can choose to apply finer-grained policies based on specific anomalies, misconfigurations, changes, misuse and much more. These are only just a few examples.

By leveraging an extremely popular open source project, the barrier of entry to this critical information has been reduced, and cloud and DevOps engineers can gain a quick understanding of the state of their highly complex cloud operations. It also comes with the added benefit of an excellent and supporting community for those who are just getting started with it.

With today’s infinitely scalable operations that are highly distributed with multiple stakeholders in the process, being able to quickly identify your cloud configurations, deployed resources, and usage is gaining importance. With cloud costs spiraling out of control, and the cloud attack surface growing daily, in this specific case, knowledge is power. Ultimately, how do you then leverage this knowledge? Well, that would be a superpower.

Logo source: https://www.openpolicyagent.org/