Troubleshooting a Connection Timeout Issue with tcp_tw_recycle Enabled

By: Edward Lin and Huai Jiang

Availability and stability are very important for eBay's site, especially for those applications that take high traffic and are dependent on many other applications, such as CAL (our Centralized Application Logging framework). This blog shares an issue that happened recently that impacted the availability and stability of CAL, and how we found out the root cause using tcpdump and systemtap.

CAL is eBay's Centralized Application Logging framework. The main purpose of CAL is to facilitate a centralized collection of application servers' local logs and to provide reporting on the collected data. These reports provide great insight into important areas of the eBay site—invaluable for making decisions on all aspects of business at eBay.

In the CAL infrastructure, as the following figure shows, a CAL client will try to establish a long-term TCP connection to a CAL VIP (virtual IP address). There are multiple load balancers (LBs) and VIPs for CAL, because CAL's traffic is very large; a single pair of load balancers can't handle this large of a connection. A client will randomly pick one VIP to connect with based on DNS. If a connection can't be established, for instance, if the connection timed out or was refused, the CAL client will retry and pick another VIP, until it establishes a connection successfully.

Recently, the CAL team and the operation team found CAL VIPs have some weird issues. VIPs sometimes return connection timeout errors intermittently, like the following nc (Netcat) command output, it returned connection timeouts frequently, which impacted CAL’s stability.

Troubleshooting

Here’s how we handled troubleshooting.

1. TCPDUMP in LB side

Why does a VIP return a connection timeout error sometimes? The VIP’s configuration looked fine, so we looked at the tcpdump to try to find a reason. The following Wireshark screenshots show what happened when the connection timed out. The LB sent a RST packet because of flow expiration (handshake timeout). The second screenshot shows how the SNAT (Source Network Address Translation) IP sent the SYN packets three times, but the server ignored all of them. After 15s, the LB SNAT IP sent an RST packet to close the connection.

2. Check SYN Queue and Accept Queue in Server side

But why did the server ignore the LB's SYN packets? As there was CAL traffic migration going on, the most of traffic of CAL was migrated from the previous old application to the new application. The immediate suspect was targeting the new application. The new application not able to catch up, which then caused the SYNC Queue to fill up. The following illustration shows how the the SYN Queue and the Accept Queue work in the Linux Kernel.

At first glance, the evidence seems support this assumption. However, running a probe (Systemtap Scrpit) to catch queue overflow didn't catch anything.

We rechecked the OS network statics using netstat -s. The majority of SYN to LISTEN that dropped did not happen because of socket overflow. The passive connections were rejected because of the time stamp.

'Passive connections rejected because of time stamp' is dropped by LINUX_MIB_PAWSPASSIVEREJECTED. It only happened in one place in this Linux kernel code. When the function tcp_peer_is_proven(req, dst, true) returns false, the kernel will drop the SYN packets. The kernel code snippet is below.

We then suspected that the LB SNAT plus the Linux Kernel parameter net.ipv4.tcp_tw_recycle together caused this issue.

3. How the LB SNAT works

The following illustration shows how SNAT works on the LB side. There's a Connection Table maintained in the LB's memory, which records the mapping relationship between the Client-VIP connection and the SNAT-PoolMember connection. In layer 4 mode, the LB changes only src_ip, src_port, dst_ip, dst_port, then sends it to another connection side.

Because TSval is based on the client CPU time, which is different across clients, different packets from different clients have a different TSval.

4. TCPDUMP and use systemtap script to confirm the issue

A new systemtap script was written to probe into these functions. Running tcpdump and probing on the CAL server at the same time produced matching results.

From the tcpdump side, the application running on CAL server port 1120 sends a (FIN, ACK) to LB SNAT IP. Then LB sends an ACK, its TSval is 517740536, and then the connection closed. After several seconds, the LB used the same SNAT IP trying to establish a new connection to same destination IP and port. Because the previous ACK was send at 18s and the new SYN is received at 60s, the interval is less than 60s, (u32)get_seconds() - tm->tcpm_ts_stamp < TCP_PAWS_MSL returns true (60−18<60). Then tm->tcpm_ts (517740536) and req->ts_recent (481284815) are compared. As the stap script's output, the delta is larger than TCP_PAWS_WINDOW, which is 1. So the function tcp_peer_is_proven returns false and the kernel drops this packet.

Conclusion

Because TSval is based on the client CPU time, which is different across different clients, the kernel may drop some connections unexpectedly if the traffic was through a load balancer. So net.ipv4.tcp_tw_recycle should be disabled in SNAT network. This feature has been totally removed in the kernel since Linux 4.1.

Reference

Tags: Cloud, Performance Engineering