SRE Case Study: Mysterious Traffic Imbalance

As an architect of a large website, I spent over a decade of my life working on all kinds of troubleshooting cases. Many of those cases were quite challenging, similar to finding a suspect in a megacity, yet quite rewarding. I ended up with many Sherlock Holmes stories to tell. What I am sharing today is a troubleshooting case of mysterious traffic imbalance.

Once upon a time there was a website. I am going to call it foo.com, but the name doesn't really matter. Feel free to replace it with any name that sounds better to you.

Foo.com had two data centers, Miami and Denver, running in active-active mode for business continuity and disaster recovery. Web traffic was evenly distributed between the two data centers by round robin DNS.

If you haven't heard about round robin DNS, the way it works is quite simple. As foo.com runs in two data centers, the foo.com name is registered to two IP addresses in Domain Name Server (DNS). The IP address in Miami is 100.100.100.100, and the IP address in Denver is 200.200.200.200. When clients browse foo.com, the first thing they do is to resolve the name into IP addresses. For each of the name resolution request, DNS server returns the two IP addresses in alternative order. For instance:

  • The first client asks: what are the IP addresses of foo.com? DNS answers: The IP addresses of foo.com are 100.100.100.100 and 200.200.200.200.
  • The second client asks: what are the IP addresses of foo.com? DNS answers: The IP addresses of foo.com are 200.200.200.200 and 100.100.100.100.

Each client selects the first IP address in the response, so the first client talks to 100.100.100.100 in Miami, the second client talks to 200.200.200.200 in Denver, so on and so forth. When there are millions of clients, the end result is that both Miami and Denver would receive approximately the same amounts of traffic.

It had been working like this for many years until mid-2007, when the Site Reliability Engineering (SRE) team noticed that Denver started getting slightly more traffic than Miami. The discrepancy was under 1%, which wasn't significant enough to cause any impact. It just seemed to be strange as it never happened before, so the SRE team opened a case and started to monitor the traffic distribution more closely.

After several weeks of monitoring, the team clearly observed a trend that the Internet traffic from the users was shifting to Denver slowly and consistently, from 1% to 2% to 3%. At this point, the severity level of the case was raised and more engineers were grouped together to figure out the root cause.

The team identified the related components in the data flow and checked all of them.

  • They verified that the DNS systems did return the IP addresses in round robin fashion.
  • They verified that all the major Internet service providers were not having any significant outage.
  • They analyzed the traffic in Denver and Miami to see if the extra traffic came from a specific Internet service providers, or a specific country, or for a specific URL, but nothing stood out.
  • They verified if the report generating system was working properly, and confirmed that the report was accurate and the system wasn't missing any data.

While the troubleshooting activities were taking place, the discrepancy was still growing slowly and consistently, from 3% to 5% to 10% in several weeks. 10% of traffic imbalance wasn't a problem by itself. The website was designed to absorb much higher of discrepancy. The problem was that the reason of the discrepancy remained mysterious. Such a clear and growing pattern without a clear reason was very strange. The severity level was raised, the team was still in the dark, and everyone started to feel the pressure.

The first thread of light arrived two months later, when one of the engineers noticed that most of the extra traffic in Denver came from IE7 (by User-Agent header of the HTTP requests, in case you are curious). This version of IE7 was only available in Windows Vista at that time, and Windows Vista was released right before the initial report of the traffic imbalance.

So the question became: why does Windows Vista prefer Denver?

The reason was still unknown, but the team felt relieved as they knew the rest the troubleshooting would be easy and straightforward. Why? SRE veterans know that the most challenging phase of troubleshooting is when there is no clue. When they are troubleshooting something and feel there is no clue, it means they haven't yet collected enough data. They must keep digging wider and wider, which would be time consuming and difficult to certain extent, especially when the troubleshooting effort is under time pressure. As soon as they find a clue pointing to a certain direction, digging 100 feet deep on that direction is much easier than turning an acre of land up side down.

As a Sherlock Holmes story, the second half is the deciphered version.

In 2003, Microsoft proposed RFC 3484 and decided to adopt it in Windows Vista. RFC 3484 defined a "longest matching prefix" method for a client machine to select the server IP address from round robin DNS. Taking foo.com as an example, let's say a client whose IP address is 150.150.150.150 talks to foo.com. It asks DNS server to resolve foo.com into IP addresses. DNS server returns two IP addresses, 100.100.100.100 and 200.200.200.200. Instead of selecting the first IP address, the client will use following procedure to decide which foo.com IP it should connect to:

a) Convert the IP addresses from decimal to binary (e.g. 100 = 01100100, 150 = 10010110, 200 = 11001000)

  • Client IP = 150.150.150.150 = 10010110 . 10010110 . 10010110 . 10010110
  • foo IP 1  = 100.100.100.100 = 01100100 . 01100100 . 01100100 . 01100100
  • foo IP 2  = 200.200.200.200 = 11001000 . 11001000 . 11001000 . 11001000

b) From left to right, compare the binary string of client IP with foo IP 1 and count the length of matching bits, until the first un-matching bit is reached. The first bit of client IP is "1", the first bit of foo IP 1 is "0", so the length of matching prefix is 0 (no matching bits at all).

c) In the same way, compare the binary string of client IP with foo IP 2. The first bit matches (first bit is 1 in both client IP and foo IP 2), the second bit does not match (it's 0 in client IP, but 1 in foo IP 2), so the length of matching prefix is 1 (only the first bit matches).

d) foo IP 2 is selected because it has a longer matching prefix than foo IP 1 (1 vs 0).

Around the same time that RFC 3484 was proposed, there were two other technologies getting popular: Broadband Internet and 802.11 Wi-Fi. More and more households switched to cable or DSL, and set up a wireless router for their home Internet access. Most of the wireless routers (such as Linksys or D-Link) were designed to assign 192.168.0.0 to 192.168.255.255 private IP range to the home computers.

Those events were unrelated, until January 2007 when Windows Vista was released.

Let's see what happened when Windows Vista users connect to foo.com via their wireless routers at home:

  • Client IP = 192.168.100.100 = 11000000 . 10101000 . 01100100 . 01100100
  • foo IP 1  = 100.100.100.100 = 01100100 . 01100100 . 01100100 . 01100100
  • foo IP 2  = 200.200.200.200 = 11001000 . 11001000 . 11001000 . 11001000

Comparing client IP with foo IP 1, the length of matching prefix is 0. Comparing client IP with foo IP 2, the length of matching prefix is 1. So Windows Vista selected foo IP 2, which was in Denver. With time going, more and more home wifi users upgraded to Windows Vista, so engineers at foo.com observed the increasing traffic imbalance between their Denver and Miami data centers.

Technically speaking, the "longest matching prefix" method may be helpful only if both client and server are on public IP addresses. It doesn't make any sense when client is on private IP address because private IP addresses are not routable on the Internet, nor would they indicate the distance to any public IP address.

After the root cause was identified, the next step was to find and implement a solution to rebalance the traffic. Microsoft could not force their users to patch Windows Vista, so the engineers at foo.com had to look for solutions on the server side. What they did was to change the Denver IP from 200.200.200.200 to 100.100.200.100, so that both Denver IP and Miami IP have the same length of matching prefix comparing with the 192.168 home Wi-Fi addresses. As the result, the longest matching prefix method in RFC 3484 was bypassed, the round robin behavior was restored, and the traffic finally became balanced again.

To make the story complete:

  • Two years later in March 2009, Microsoft accepted that the "longest matching prefix" method is inappropriate, and fixed it in Windows 2008 R2 and Windows 7.
  • Five years later in September 2012, RFC 3484 was finally obsoleted and replaced by RFC 6724.

Reviewing this troubleshooting case, there are several principles we learned and integrated in our decision making procedures ever since. These principles are becoming more and more valuable because the cloud is getting more and more complicated, which is why I was inspired to write this article.

  1. When proposing an RFC, its impact to the entire Internet community should be carefully evaluated. The Internet is a complicated system. A technology that benefits in a one area may result in unexpected impact in the other areas.
  2. When adopting an RFC, its feasibility in currently ecosystem should be advertently considered. The Internet is fast growing system. An RFC appropriate yesterday may become inappropriate today while the ecosystem evolves.
  3. Before releasing a product, its features should be thoroughly tested in real world. The Internet is a group of systems without central governance. If an inappropriate feature gets released, fixing it could take a lot of time and effort.

Have you run into similar incidents? What are your guiding principles of selecting RFCs?