Identifying and Eliminating Extra DB Latency

I would like to share my experiences with troubleshooting a recent issue on Oracle databases in our datacenters. While the details in this post involve batch jobs, the lessons learned can apply to any business having Oracle SQL traffic between two or more datacenters.

The problem

We noticed that the running time on a set of batch jobs had suddenly increased substantially. Some increase was understandable, since we had moved a significant percentage of the database hosts accessed by these batch jobs to another eBay data center. However, the latency had jumped beyond what had been expected or could be explained.

The investigation

The first step was to check our logs for the most expensive SQL statements.

For batch jobs, it is very easy to distinguish cross-colo and inter-colo transactions, since all batches run on boxes located in the same datacenter. I immediately found a discrepancy between the SQL times for inter-colo calls compared to cross-colo calls. I expected to see a consistent difference of 20 ms for both minimum and average times; 20 ms is the network latency between the two datacenters involved. In reality, the differences far exceeded 20 ms. The following table shows the statistics collected for approximately 8500 calls in an hour:

Statistic

Inter-Colo

Cross-Colo

Exec Count

8600

8490

Minimum/Maximum/Average

1.00/119.00/5.70

21.00/487.00/36.25

Total Time

49028.00 ms

307758.00 ms

 

On further investigation, I discovered an additional complication. For some queries, the time difference was consistently 20 ms, and for other queries it was consistently 40 ms.

Next, I looked into database performance. I found that the query execution time was the same on all of our database hosts, and that the issue was reproducible independent of the batch jobs, using only tnsping and the Oracle SQL client.

At this point, I almost didn’t suspect a database issue at all; the statistics showed the same results for all DB hosts in the datacenter.  Instead, I suspected an issue related to either the network or network security.

A possible firewall capacity problem was ruled out because firewall latency contribution is constant, and much lower than the latency introduced by the network. When we looked at the network, we saw that ping consistently showed 20 ms, regardless of the time of day or the traffic activity. Of course, in real life we do not merely exchange pings, but rather use actual SQL queries to communicate with DB hosts, so we looked at a tcpdump analysis of the DB hosts next. 

As the root cause of our problem was still unknown, there was no guarantee that the issue could be reproduced by using tcpdump while running queries manually. After trying twice and not getting enough information, we concluded the only reliable way to reproduce the issue was to run tcpdump while the batch job was running.

The last thing to do was to wait until the next high-volume batch run. Finally, we got enough data, and we almost immediately saw an answer (saving us from having to dive even deeper and look at MTU or window size as the cause of additional round trips).  Here is the analysis of the packets for a single query:

No.  Time     Source  Destination  Protocol  Length  Info
1    0.000000 [IP1]   [IP2]        TNS       549     Request, Data (6), Data
2    0.000714 [IP2]   [IP1]        TNS       852     Response, Data (6), Data
3    0.021257 [IP1]   [IP2]        TNS       71      Request, Data (6), Data
4    0.021577 [IP2]   [IP1]        TNS       153     Response, Data (6), Data

Processing the query required four packets spanning about 21.6 ms. Since round-trip-time (RTT) is about 20ms, prior to the first packet there would be 10 ms—the transit time from when the packet left the first datacenter and arrived in the second datacenter, where it was captured as packet 1. After packet 4, there was another 10 ms—the transit time back to the first datacenter. The total RTT for these four packets was therefore about 41.6ms.

We saw that the SELECT statement was in packet 1 and that the response appeared to be in packet 2. We didn’t know what packets 3 and 4 were, but their small size suggested they were some kind of acknowledgement.

The solution

At this point, we had confirmation that for each query, there were four network packets instead of two. The fact that every execution took two round trips indicated that more soft-parsing was occurring than was necessary. The question was, why were commonly running queries getting parsed multiple times? Once parsed, a query should be added to the app server statement cache. We increased the statement cache size parameter, which dictates how many parsed SQL statements can remain in cache at a given time, from 80 to 500.

When the tuned settings went into effect, we were able to eliminate an extra cross-colo round trip for a majority of SQL requests. We immediately saw a positive impact, observing a 30% reduction in execution time for all instances of a batch job.

This experience demonstrates how even small changes in DB settings can positively affect execution response times. And for me, it was both challenging and interesting to drive problem resolution in an area where I did not have much prior experience.