Safe ACL Change through Model-based Analysis

Model-based analysis makes it possible to rigorously test and validate changes prior to deployment. This approach enabled the eBay Network Engineering staff to undertake a project to refactor a large business critical Access Control List, resulting in an 80% reduction in size, without any adverse business impact.

The eBay Site Network hosts, a core part of eBay’s business. As is best practice for a large Internet site, the first line of defense for the eBay Site Network is an Access Control List (ACL) at the border of the network. This ACL is programmed on the edge network routers which connect the eBay Site to the Internet. It protects the Site by restricting traffic flows that are allowed to enter the eBay Site Network.  The default action for the border ACL is to block network flows. Therefore, allowed network flows are referred to as “ACL exceptions,” and the majority of ACL changes consist of either adding or removing ACL exceptions.

The border ACL changes match the pace of eBay Site infrastructure changes. At any time, thousands of ACL entries are programmed in the border ACL, which are necessary for the correct operation of the eBay Site. Over time, a large amount of no-longer-necessary ACL entries tend to build up in the border ACL, and the organization of the ACL itself tends to become fragmented. This is a form of technical debt. The main underlying causes are: 1) unused ACL entries that do not immediately affect Site operation; 2) the inherent risk in removing ACL exceptions, as doing so can result in blocking necessary network flows and reduce Site Availability and 3) the inherent risk in re-factoring the structure of the border ACL to consolidate it, as seemingly simple operations such as changing the order of ACL entries can cause a change in behavior.

The eBay border ACL is maintained with the help of an open source tool called Capirca. Among other things, this tool provides separation between policy and definitions. If leveraged properly, this separation makes the policy smaller and easier to work with. The definitions can be generated automatically from the authoritative source of truth for network assets. When we started this project, the Capirca source files themselves carried much redundancy and fragmentation. This compounded the issues with no-longer-necessary ACL entries, and created severe technical debt.

One metric that we use to track the complexity of the border ACL is the number of ACL entries. At eBay, this metric is in multiple thousands. The same metric is used to track the amount of change over time. One discipline developed primarily to manage complexity is Software Engineering. We observed that a network ACL and a computer program have similar properties: both require maintaining a large set of ordered instructions; and both are very sensitive to small variations, such as a mistyped character. Therefore, it made sense to adopt Software Engineering tools to manage the complexity of the border ACL. One such tool is a popular Revision Control System called Git. Figure 1 shows the trend of the border ACL  Lines Of Code (LOC) metric over the last few years. Derived metrics capture the rate of change. This can be expressed in the frequency of atomic changes, and the amount of lines changed. In software management, the two metrics correspond to frequency of commits, and size of commits respectively. 

acl git log2

Figure 1. LOC metrics of the two components of the border ACL over time: the main policy (blue) and source of truth for network IP space ranges (orange). The large drop in complexity metrics on the right hand side reflects the work described in this article.

We have long-standing, internal processes to control changes and regularly audit the border ACL to manage the inherent risks described above. Due to the size and complexity of the ACL, such processes had become effort-intensive and less effective. This made it difficult to meet the SLA for ACL exception requests, and also made every ACL modification a more dangerous change.

Refactoring the border ACL became crucial to address this technical debt. A well-structured ACL is necessary for streamlined auditing and fast changes. The major challenge with refactoring the border ACL is to ensure that the refactored ACL is functionally equivalent to the currently deployed ACL, before applying the change to the live Site environment. Each change set could be hundreds of lines, and at this scale, manual verification (eyeballing) is ineffective and highly risky.

Further, this type of change is difficult to test in an emulation-based Staging environment. We do not have an effective way to enumerate and inject the full set of network flows that the eBay Site requires in order to operate. That would require a Staging environment as complex as the Site itself.

In a typical software environment, this kind of challenge is addressed using a comprehensive test-suite that analyzes the software’s behavior prior to deployment. We had to find a similar way to validate the candidate ACL. We identified model-based analysis as the solution to this challenge. Model-based analysis builds a model of network behavior based on its configuration and then analyzes it to check behaviors in a range of scenarios. It can perform both testing (i.e., analyze individual inputs) and verification (i.e., prove correctness for all possible inputs). This blog explains different forms of network validation in detail.

We selected Batfish, an open source tool that provides model-based analysis capabilities. Using Batfish, we developed an iterative procedure to: 

  1. Model the full ACL behavior before and after changes.

  2. Calculate the space of ACL actions (allow, block) for all possible network flows.

  3. Use an algorithm to attempt to prove that the space of ACL actions is identical before and after the change.

  4. If the algorithm is unable to prove the assertion above, it would produce a list of counter examples, that is, network flows for which the ACL action is different before and after the change.

  5. Use the counter examples to correct issues with the new ACL.

  6. Add a unit test to the ACL to prevent any regression of the issues.

  7. Commit the results of this iteration to the git feature branch. 

We describe an instance to illustrate the process described above. In one of the consolidation steps, the original ACL has the following terms: 

  1. Permit TCP traffic from any source to host A.

  2. Deny TCP traffic to subnet S, which contains A.

  3. Permit TCP traffic from any source to host B.

A candidate ACL consolidates terms a and b, since they express the same policy. The resulting ACL is:

  1. Deny TCP traffic to subnet S, which contains A.

  2. Permit TCP traffic from any source to host A or B. 

When the procedure is applied to compare the original ACL (before change) with the candidate ACL (after change), step 4 reports a change in behavior. The example flow is: TCP, any source, to destination host A; and the behavior change is “permit” to “deny”. Based on this information, the developer prepares a new candidate ACL, which looks as follows:

  1. Permit TCP traffic from any source to host A or B.

  2. Deny TCP traffic to subnet S, which contains A.

As the procedure is executed on the new ACL, step 4 reports that the behavior is now identical. Therefore, it is safe to commit this specific change set to the repository. The issue with this simple ACL would have been easy to spot - much less so, when hundreds of terms change order in a set of many thousands.

By iterating the procedure described above, a small team was able to re-factor the border ACL with an acceptable amount of risk to Site Availability. Consistent with the assessed risk, throughout the project, no impact to any business flow was observed. 

Conversely, the beneficial effects of this refactoring include: 

  1. Lowering the risk for future border ACL changes

  2. Lowering the turnaround time for border ACL changes from months to under a week

  3. Enabling effective auditing of the border ACL, due to its new modular structure

  4. Freeing up many hours of time for the team to spend on Operations and Engineering.

  5. Reducing the size of the border policy by 80%.

  6. Enabling further consolidation by de-risking the process of removing no-longer-needed ACL exceptions. 

We are now leveraging model-based analysis, and Batfish specifically, in other areas of Network Engineering, such as testing a new Network Product before deploying; staging Firewall changes; and automatically validating network changes.

The case study presented in this article is just one of the ways in which eBay is advancing the state of networking using modern tools and practices. Get in touch if you’d like to learn more!