Tess.IO is eBay’s cloud platform, with a vision of ensuring a world-class build, ship, and run experience for eBay’s applications at scale while ensuring high efficiency, security, and agility for the developers. Tess.IO leverages Kubernetes under its hood.
As more and more applications are deployed on the Tess.IO cluster, scalability and capability of the cluster become more and more important. Kubernetes upstream has claimed it officially supports 5000 nodes. But there is a long journey to support it in a real production environment. To make the Tess.IO cluster production ready, on each node we deployed additional components, such as a network agent to configure the network for pods, beats, and a node-problem-detector for monitoring, etc. All these add-on components from every node need to interact with the control plane of the cluster. Meanwhile, cloud-native customer pods add more load against the cluster by using CustomResourceDefinition. All these factors restrict the scale of production cluster.
This article describes how to achieve the 5000 nodes goal for the Tess.IO cluster.
Tess.IO cluster architecture
To discuss scalability of a cluster, we must first talk about the deployment architecture of a Tess cluster. To achieve the reliability goal of 99.99%, we deploy five master nodes in a Tess.IO cluster to run Kubernetes core services (apiserver, controller manager, scheduler, etcd, and etcd sidecar, etc). Besides core services, there are also Tess add-ons in each node that expose metrics, set up networks, or collect logs. All of them are watching resources they care about from the cluster control plane, which brings additional loads against the Kubernetes control plane. All the IPs used by the pod network are global routable in the eBay data center. The network agent on each node is in charge of configuring the network on host.
Experiments with Kubemark
First, we should figure out whether the etcd/apiserver with the current configuration and deployment in our Tess cluster can sustain the pressure from the components of the control plane and system daemon, because the connection/api load from the system daemonset and kubelet is increasing with the node number. This is the load from our cluster itself, so this is the baseline.
Then, we should figure out the capability to respond to the change from the customer within the built-in base apiload. For example, we could create/delete 1000 pods at the same time to see the error rate, QPS, and latency in percentile.
Api calls to create/delete pod in parallel | 50% Latency | 75% Latency | 90% Latency | 95% Latency | 99% Latency | Error Rate |
Besides this, we should also figure out the capability to withstand the stress from customer pods, for example, the upper bound of watcher or how many list requests apiserver can bear.
Count of watch pods | Update rate of pods(qps) | Error rate | Apiserver/etcd Memory | Apiserver/etcd CPU | 75% Update Latency | 90% Update Latency | 95% Update Latency |
List pods calls / min | Update Rate of pods(qps) | Error rate | Apiserver/etcd Memory | Apiserver/etcd CPU | 75% List Latency | 90% List Latency | 95% List Latency |
Simulated test environment
In order to simulate 5000 nodes, we spun up 5000 kubemark pods in cluster B and registered it to the apiserver of cluster A. The benefits are:
- It spares 5000 VMs in the eBay datacenter.
- It is easy to scale out 5000 nodes with Kubernetes cluster B (private IPs).
- It saves public IPs of the target cluster. (A public IP is global routable in eBay datacenter.)
- It isolates the impact between kubemark nodes and target cluster A.
With these kubemark pods, we added 5000 fake nodes to cluster A. We should also simulate the behavior of other components; just abstract the logic related to the interaction with apiserver and build these as containers into the kubemark pod. One kubemark pod includes one fake node, and the cluster add-on daemonset is located on this node.
Issues found with 5k nodes
When running the tests described in above section, the following issues arose.
-
Failed to recover from failures on cluster with: 5k nodes, 150k pods.
- ETCD exhausts large amount of memory (around 100GB) in each master node.
- ETCD has frequent leader elections.
- In etcd log, it showed:
2018-07-02 17:17:43.986312 W | etcdserver: apply entries took too long [52.361859051s for 1 entries] 2018-07-02 17:17:43.986341 W | etcdserver: avoid queries with large range/delete range!
- Apiserver containers keep restarting every few minutes.
-
Pod scheduling is slow in a large cluster.
- When there are only 5k nodes, and no pods in the cluster, it takes about 20 minutes to schedule 1000 pods. The average cost is 1~2s per pod. The average cost of scheduling one pod, when there are already 30000 pods, reaches up to 1 minute.
-
Large list requests will destroy the cluster.
- In the Tess.IO cluster, a pod is a very important resource that represents the entity of applications. The indicator of the scalability of the cluster is not only the count of nodes, but also the count of pods.
- Pod information is important to the components of the cluster control plane and customer applications; there will be pod queries from the cluster control plane and customer applications.
-
With the increasing number of applications onboard, the pod count becomes very large. The LIST requests for pod resources become Large Range requests. A large amount of concurrent LIST pod requests can exhaust the buffer window of the kube-apiserver and impact core probes.
- Based on the test results, the rate of sending LIST all pods is 50%, and Node PATCH and Node LIST nodes are impacted.
-
Etcd keeps changing leaders.
- The ETCD sidecar in a Tess.IO cluster takes etcd snapshots every half an hour. The ETCD cluster changes the leader while snapshots are being taken.
Solutions for scaling up to 5000 nodes
Basically, there’s a prerequisite for a large-scale cluster, increasing the max mutating in-flight request to 1000. For 5k nodes, just considering path nodes requests from kubelet (patch node in each 10s), the average is 500 patch request per second (5000/10=500). There are 5 apiservers, so each apiserver will get 100/s ideally. Besides, Node-Problem-Detector and other core components also patch the node at the same time. If the connections to patch nodes are not even, if some apiservers are down, or if some long READ transactions impact patch, it’s very easy to hit the default inflight-limit of 200, and return request rejection (response code 429). Then clients will retry and bring an additional load to apiserver.
Recover from failures
On each node, there are several daemonsets to configure the network, collect metrics/logs, or report hardware information. So the kubelet is not only the component watching all the pods on this node; there are also these other daemonsets.
Make sure daemonsets won't override ListOption
In the default List/Watch mechanism, the first request is a LIST call. It sets ResourceVersion=0 to its listoption and gets the resource list from the apiserver cache instead of etcd. We should make sure all these daemonsets won’t override the ListOption when registering the customized ListFunc and WatchFunc. If not, all of them will go through the apiserver cache and hit etcd directly. If apiserver gets restarted or large amount of applications like the daemonset get deployed at the same time, all the LIST pods requests will hit etcd. This is a disaster to kube-apiserver.
Return an error before watch cache is ready
There are five daemons in each node to list/watch all pods on the node. These component query pods with fieldSelector (the first LIST request is send with resourceVersion=0, nodename=<nodeName>). Before the pod watchcache in apiserver gets ready, apiserver forwards these LIST requests directly to etcd. Because of the large amount of pods, it takes several seconds for the pod watchcache to get ready. Also when apiserver restarts, all the pod watchers will resend LIST requests in parallel. All these requests will hit etcd directly, which puts huge pressure on the etcd server.
The solution is to return err before watch cache is ready.
$:~/go/src/k8s.io/kubernetes$ git diff staging/src/k8s.io/apiserver/pkg/storage/cacher.go diff --git a/staging/src/k8s.io/apiserver/pkg/storage/cacher.go b/staging/src/k8s.io/apiserver/pkg/storage/cacher.go index 622ca6a..8ba0222 100644 --- a/staging/src/k8s.io/apiserver/pkg/storage/cacher.go +++ b/staging/src/k8s.io/apiserver/pkg/storage/cacher.go @@ -488,9 +490,15 @@ func (c *Cacher) List(ctx context.Context, key string, resourceVersion string, p return err } + if listRV == 0 && resourceVersion == "0" && !c.ready.check() { + glog.Infof("apiserver cache is not ready! just return err when resourceVersion=0\n") + return errors.ServiceUnavailable("apiserver cache is not ready") + } + if listRV == 0 && !c.ready.check() { // If Cacher is not yet initialized and we don't require any specific // minimal resource version, simply forward the request to storage. return c.storage.List(ctx, key, resourceVersion, pred, listObj) }
Store attributes in watch cache
After the cache is ready, all the LIST requests are sent, in parallel, to apiserver again. Though the status of etcd cluster is normal, the stress is moved to apiserver. From the pprof, it shows that apiserver is busy filtering pods on the specified node over the 150k pods.
File: kube-apiserver Type: cpu Time: Jul 5, 2018 at 5:26am (-07) Duration: 30.14s, Total samples = 3.73mins (742.04%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 164.39s, 73.51% of 223.62s total Dropped 935 nodes (cum <= 1.12s) Showing top 10 nodes out of 96 flat flat% sum% cum cum% 40.78s 18.24% 18.24% 181.28s 81.07% k8s.io/kubernetes/pkg/registry/core/pod.GetAttrs 25.50s 11.40% 29.64% 119.42s 53.40% runtime.mapassign 23.53s 10.52% 40.16% 23.53s 10.52% runtime.procyield 17.91s 8.01% 48.17% 17.91s 8.01% runtime.memclrNoHeapPointers 12.11s 5.42% 53.59% 12.11s 5.42% runtime.futex 11.35s 5.08% 58.66% 11.35s 5.08% runtime.heapBitsForObject 11s 4.92% 63.58% 27.96s 12.50% runtime.scanobject 8.84s 3.95% 67.53% 8.84s 3.95% runtime.memmove 7.12s 3.18% 70.72% 7.16s 3.20% runtime.greyobject 6.25s 2.79% 73.51% 6.25s 2.79% runtime.memeqbody (pprof) tree pod.GetAttrs Showing nodes accounting for 176.74s, 79.04% of 223.62s total Dropped 69 nodes (cum <= 1.12s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 181.28s 100% | k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage.(*SelectionPredicate).Matches 40.78s 18.24% 18.24% 181.28s 81.07% | k8s.io/kubernetes/pkg/registry/core/pod.GetAttrs 119.07s 65.68% | runtime.mapassign 19.35s 10.67% | runtime.makemap 2.07s 1.14% | runtime.writebarrierptr ----------------------------------------------------------+------------- 119.07s 100% | k8s.io/kubernetes/pkg/registry/core/pod.GetAttrs 25.47s 11.39% 29.63% 119.07s 53.25% | runtime.mapassign 78.31s 65.77% | runtime.newarray 12.77s 10.72% | runtime.typedmemmove 1.85s 1.55% | runtime.aeshashbody 0.24s 0.2% | runtime.writebarrierptr
GetAttrs gets fields and labels from the object and sets it into a golang map. Mapassign is a time-consuming operation. In the version before release 1.10, this function will be invoked in a filter loop. If there are 150k pods in each LIST request with fieldSelector/labelSelector, it will be called 150k times. The solution is to move GetAttrs to a unified place. When processing the object from etcd to the apiserver cache, it will get attributes and store them in the cache together with its objects.
After all these changes, apiserver and etcd can tolerate a spike of large queries when apiserver restarts or clients resync list/watch.
Improve scheduler in large clusters
The pod in the following state draws down the performance of scheduling, even though there’s only one pod in such state.
- Pod is in Terminating state
-
The node, where pod locates, is gone
It is inevitable to have these kind of pods in a large cluster in a real product environment, so it’s important to mitigate this situation.
Why did performance issues show up in this case?
There’s one condition for scheduling any pod: checking whether scheduling the pod onto this node would break any anti-affinity terms indicated by the existing pods. For this condition, checking in a normal case, it will find all the pods in cluster that match the affinity/anti-affinity terms of the pod being scheduled once and store them in a metadata. Then, it checks the topology of the matching pods for each node in the cluster. But when one or more pods in the whole cluster enter into the above state, scheduler won’t create predicate metadata, then it will execute the slow path that invokes FilteredList() function to schedule any pods. This slow path goes through all existing pods while traversing each node, so the latency relies on increasing the count of nodes and pods.
How to mitigate this situation?
- Create predicate metadata anyway, even if there are no nodes specs for a pod, so it will bypass this slow path.
- Enhance the slow path—FilteredList() function based on Kubernetes 1.9, which used for above test.
- Change Mutex to RWMutex
- Remove the string concatenation in a function invoked many times
- Avoid expensive array growth
Results?
prerequisite | |
Nodes | 5000 |
Pods | 170k existing in cluster |
Additional | There are some pods in Terminating status whose node has already been deleted |
Case | Scale up 1000 pods |
Previous Result | It costs 7 minutes to schedule one pod |
Result with Enhancement | 15.2 pods per second to be scheduled |
Aug 16 17:26:32.830: INFO: [Result:Performance] { "version": "v1", "dataItems": [ { "data": { "pod_anti_affinity_average": 33.55688359, "pod_anti_affinity_perc50": 33.593305395, "pod_anti_affinity_perc90": 60.215760331, "pod_anti_affinity_perc99": 65.001977014 }, "unit": "s" }, { "data": { "pod_anti_affinity_throughput": 15.201624640177682 }, "unit": "pods/s" } ], "labels": { "group": "scheduler", "name": "Latency of 1000 pods with pod anti-affinity to be scheduled" } }
Separate pods into different etcd instances
In view of
- Pods and Nodes are the two largest resources in a Tess.IO cluster.
- PATCH node is the most frequent request.
- In ETCD cluster, there is a global lock.
Then, these two resources—pods and nodes—will race each other for the global lock, which will cause low throughput.
Separating pod resources into a dedicated etcd cluster would help the scale. During the scale test before resources separation, when there are large volumes of LIST pods in parallel (hit etcd), node patching is affected. If the node cannot be patched successfully for a long time, the node will become NotReady, which is dangerous.
Add RateLimit to a large LIST request
In the test cluster, the kube-apiserver/etcd can stand 1000 concurrent requests on pod creation/deletion. After all the core components have SharedInformer/ListWatch refactored, the control plane can get 5 apiservers restarted at the same time (all the clients start sending requests, which hit the apiserver cache instead of bypassing it).
The key is to set up the rate-limit to large LIST queries, which bypasses the apiserver cache, and avoids too much load on etcd.
We implemented rate-limit based on throughput, and record and refresh the hot, heavy LIST request patterns and costs (received bytes from etcd and send out bytes from apiserver to client) in real time. Then it will predict the cost of new incoming LIST requests according to the pattern cache and enforce rate-limit by a customized quota pool.
Write backup to different disks with etcd data
-bash-4.2# salt "kubernetes-master*" cmd.run "grep 'sync duration' /var/log/etcd.log | tail -n 1" kubernetes-master-52-4.stratus.slc.ebay.com: 2018-08-27 07:37:16.301942 W | wal: sync duration of 5.095883138s, expected less than 1s kubernetes-master-52-3.stratus.slc.ebay.com: 2018-08-27 07:31:22.371713 W | wal: sync duration of 1.626935596s, expected less than 1s kubernetes-master-52-2.stratus.slc.ebay.com: 2018-08-27 07:33:04.333038 W | wal: sync duration of 2.838373071s, expected less than 1s kubernetes-master-52-5.stratus.slc.ebay.com: 2018-08-27 07:39:32.885600 W | wal: sync duration of 2.032394797s, expected less than 1s kubernetes-master-52-1.stratus.slc.ebay.com: 2018-08-27 07:32:56.627090 W | wal: sync duration of 2.439381978s, expected less than 1s
Etcd persists raft proposals to its WAL log; as we can see in the etcd.log above, the wal sync duration was larger than 1s sometimes. The upshot is that etcd may miss a heartbeat or fail to send out a heartbeat, then etcd followers will start a new leader election.
ERROR: logging before flag.Parse: I0827 07:31:59.293344 7 etcd-backup.go:775] Run backup command: 'snapshot save --endpoints=[https://kubernetes-master-52-1.52.tess.io:4001] --cacert=/etc/ssl/kubernetes/ca.crt --cert=/etc/ssl/kubernetes/etcd-client.crt --key=/etc/ssl/kubernetes/etcd-client.key /var/etcd/backup/1535355119293143013/1535355119293143013.snapshot.db' ERROR: logging before flag.Parse: I0827 07:32:56.903282 7 etcd-backup.go:781] Done etcdctl backup to /var/etcd/backup/1535355119293143013: Snapshot saved at /var/etcd/backup/1535355119293143013/1535355119293143013.snapshot.db
All the timestamps of "wal sync duration long" fully match the timestamp of "etcd backup."
We check the iostat of sdb, which is the etcd data disk. Most of the time, write bytes per second is 20MB/s~60MB/s. While taking a snapshot, write bytes per second is up to 500MB/s and the io utilization is 100%.
-bash-4.2# iostat -xz 1 sdb ...... Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 3263.00 0.00 30612.00 18.76 0.71 0.22 0.00 0.22 0.11 35.00 ...... Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 982.00 0.00 502784.00 1024.00 160.84 134.37 0.00 134.37 1.02 100.10 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 981.00 0.00 502272.00 1024.00 160.54 134.10 0.00 134.10 1.02 100.00
...... Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 2.00 0.00 2773.00 0.00 29536.00 21.30 0.86 0.31 0.00 0.31 0.10 27.60
We guess it is caused by flushing the etcd backup file to disk. Currently, the etcd backup is sharing the same disk with etcd data. The snapshot size is 4GB in test, and the etcd backup occupies all the disk IO and impacts sregular ETCD actions—writing the wal log.
The kernel version used by the tess.IO cluster is 3.10.0-862.3.2.el7.x86_64, the default IO scheduler is deadline, instead of cfq, which allocates timeslices for each of the queues to access the disk. IO scheduler "deadline" will aggravate the situation disk IO occupied by a flush backup.
-bash-4.2# cat /sys/block/sdb/queue/scheduler noop [deadline] cfq
In order to verify this, test by taking a snapshot and writing the backup file into a separate disk from etcd data. After these tests, there are no spikes of WAL sync duration any more while taking a snapshot, and leader election doesn’t happen, either.
Summary
Kubernetes currently claims support of 5,000 nodes in one cluster. However, a large number of existing resources (such as pods), different deployment topology, and different patterns of API loads in a Kubernetes cluster may restrict the cluster size.
After the tuning/fixes, a Tess.IO cluster makes these claims true in a real product.