Analyzing the CNI performance benchmark

First of all, huge shout-out to Alexis Ducastel for putting together a great CNI benchmark comparison. To be honest, there was definitely a moment of panic when we saw the article pop up. Did we just miss a major performance regression?

This blog post documents the investigation we have done so far of what looked like a performance regression of HTTP/FTP traffic over pure TCP.

Alexis was super quick to share the scripts that he used to collect the benchmarks numbers. This not only allowed for a quick verification but also allows us to integrate this into our CI tests and run it alongside of the existing benchmarks for better coverage.

Summary

The external etcd is no longer a requirement. With 1.3, we started providing an etcd-operator to have Cilium maintain its own etcd. We are still investigating a CRD-based replacement as well but are still seeing a significant scalability difference when running >2000 pods or >100 nodes. We leverage atomic transactions supported by etcd to avoid distributed locks and it heavily benefits the scalability of policy enforcement.
The article shows lower HTTP and FTP throughput compared to TCP. We feared this to be a BPF related performance regression but it turned out to be caused by how curl measures download speed in combination with Cilium's optimistic pod scheduling strategy which treats CNI requests mostly asynchronously. This allows a pod to start running before the networking has been setup while ensuring that policy enforcement is always enforced correctly. This helps reduce the time to schedule a pod and thus allows to schedule pods quicker. The consequence is that the TCP connection started by curl to be established later. Unfortunately, curl measures the time of the transfer from the initial connection attempt (unlike iperf3 and netperf) and thus includes the time in which the connection is waiting to be established to the total transfer time. This leads to a lower bitrate to be reported.

The latest development tree already had this behavior changed but the change was not released yet. Cilium versions 1.2.6, 1.3.1 and 1.4.0 will include the commit and change this default setting to block the CNI call until all network plumbing has been completed. We will make the functionality of asynchronous network bootstrapping available to advanced users via an option.
The article shows minor differences between raw TCP throughput of individual CNI plugins. The benchmarks of all TCP/UDP tests are limited by the speed of the 10Gbit/s network connectivity except for the encrypted test via Weave. There is a difference in theoretical payload maximum depending on whether a CNI plugin uses encapsulation or direct routing mode. In the article, Alexis is running Cilium in encapsulation mode which will lower the maximum achievable payload rate slightly because more of the available network bandwidth is used for network headers instead of payload. Multiple CNI plugins can be put in either encapsulation or direct routing mode and I would expect the numbers to change slightly accordingly. Update: According to our friends working on Weave, the Weave performance in encryption mode is bad because the MTU is misconfigured and does not take into account the ESP headers. This causes fragmentation which lowers the throughput significantly. If the MTU is configured to be 8196 to account for VXLAN + ESP, the reported throughput will be a lot better.

As a side note: While testing, we observed that the kernel being used by Alexis (4.15) has a known performance regression which is fixed by this upstream kernel commit. It's observable by comparing the number of TCP retransmissions required to transfer the data. This bug is not Cilium specific though, all CNI plugins that require a network packet to cross a network namespace boundary will suffer from this. The effect is not big enough to have meaningful impact in this 9000 MTU benchmark test but adds some variance to the test results.
The memory consumption is due to aggressively pre-allocating BPF maps in the in favour of providing lowest average latency possible. There is a PR under discussion to change the default pre-allocation strategy become less aggressive. Another PR is out to disable IPv6 by default to save additional memory resource usage by default.
The only number we could not directly reproduce is the 5.5% CPU consumption. In our tests, iperf3 or curl consume significantly more CPU on their own and the CPU usage of Cilium and kernel is insignificant to everything else.
The frequently requested transparent encryption will be added in Cilium 1.4. The work has been underway for a while and will be merged in the next couple of weeks.

TCP and UDP benchmarks

The first tests performed are TCP and UDP benchmarks using iperf3 by transmitting random data as quickly as possible for 2 minutes. The tests are run in a bare metal setup connected by 10Gbit/s hardware. As for all tests, an MTU of 9000 bytes is used. The MTU defines the maximum size of a single packet on the wire. By setting this to the largest possible number that the network hardware supports, the ratio between data payload and network packet headers can be optimized. This means that for a given available network bandwidth, more of it can be used for actual data. Large MTUs (jumbo frames) work well within datacenters but is much harder to achieve on the Internet where the MTU is typically 1500 and below. Large MTUs put the least stress on the networking layer and more stress on the application and the network hardware or the memory bus typically becomes the bottleneck.

Let's look at the numbers collected:

It would be quite surprising if any CNI plugin cannot achieve network wire speed in this scenario. Why? Let's look at the CPU utilization of the sending machine while 10Gbit/s are being transmitted with Cilium.

top - 14:25:54 up 19:33,  1 user,  load average: 1.16, 0.61, 0.30
Tasks: 419 total,   2 running, 223 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 98.4 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
KiB Mem : 61822704 total, 56957304 free,  1374384 used,  3491016 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 59416744 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
127440 root      20   0    3312    948    684 R  13.7  0.0   0:03.97 iperf3         <-- Benchmark app
  7167 root      20   0 5174800  93288  50600 S   3.9  0.2  34:13.57 kubelet
    65 root      20   0       0      0      0 S   2.0  0.0   0:26.06 ksoftirqd/9    <-- BPF in kernel
  6373 root      20   0 6644496  70412  28976 S   2.0  0.1  16:31.18 dockerd
 17895 root      20   0   54416  28384  21564 S   2.0  0.0   2:30.99 aws-k8s-agent
 94338 root      20   0   10.1g  55652  17712 S   2.0  0.1  28:31.99 etcd
     1 root      20   0  191540   5984   4148 S   0.0  0.0   0:05.45 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.07 kthreadd

As expected, the networking layer is not really consuming any significant CPU. Overall, the systesm is 98.4% idle and the majority of the CPU being consumed is by the app which produces the network traffic. This is pretty much expected, Linux has been optimized to run as a server hosting applications for decades.

The picture looks similar on the receiving side:

top - 14:32:52 up 19:40,  1 user,  load average: 0.46, 0.51, 0.39
Tasks: 424 total,   1 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.9 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
KiB Mem : 61822704 total, 45532640 free,  1472912 used, 14817152 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 59248884 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 72931 root      20   0    3436   1068    612 S  58.0  0.0   4:13.87 iperf3
 71028 root      20   0   10.1g  58400  18116 S   6.0  0.1  24:18.46 etcd
 70274 root      20   0   10.1g  56560  17456 S   4.0  0.1  24:53.08 etcd
  6977 root      20   0 5172432  92316  50740 S   2.0  0.1  40:14.05 kubelet
 84317 root      20   0   50276  25616  18704 S   2.0  0.0   4:33.68 kube-dns
 87137 root      20   0  169296   4596   3732 R   2.0  0.0   0:00.09 top
     1 root      20   0  191532   6044   4168 S   0.0  0.0   0:04.39 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.07 kthreadd
     4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H

The system is only 97.7% idle and 58% of a single core is consumed by the benchmark app.

So are there differences between CNI plugins for throughput at an MTU of 9000 at all? Yes, different CNI plugins use different architectures in connecting pods to the networks. Cilium defaults to an encapsulation based routing mechanism. The advantage of this mode is the ability to pretty much run on any infrastructure without changes. The disadvantage of this mode is a slight overhead in theoretical maximum throughput as each packet is encapsulated into additional network headers. For this reason, Cilium also supports direct routing mode with integration into any routing daemon including kube-router. So if you care about maximising the throughput and have the expertise to configure your network to route pod IPs, feel free to configure direct routing mode by running Cilium with the option --tunnel=disabled.

HTTP and FTP benchmark

This is where it gets really interesting! The benchmark shows a reduced throughput for both HTTP and FTP in the scenario of Cilium. This is very unexpected, in particular because there is no HTTP level network policy in effect and thus HTTP/FTP performance should match TCP.

The benchmark that Alexis is performing is to run nginx in a Kubernetes pod with a volume mounted that contains a 10G file consisting of random data and then fetching it via curl to measure the duration of the download.

The initial step was to run this benchmark and just run top on the side:

top - 15:06:15 up 20:13,  1 user,  load average: 0.51, 0.36, 0.26
Tasks: 502 total,   2 running, 229 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.6 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 61822704 total, 56942360 free,  1376780 used,  3503564 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 59411592 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 65922 root      20   0   14096   3220   2804 S  18.3  0.0   0:00.55 curl
  7167 root      20   0 5175856  93868  50600 S   4.3  0.2  35:25.54 kubelet
  6373 root      20   0 6644496  71252  29040 S   2.3  0.1  17:03.42 dockerd
 57396 root      20   0 3725028 124768  37952 S   2.0  0.2   0:04.29 cilium-agent
 94338 root      20   0   10.1g  57432  17864 S   2.0  0.1  29:23.63 etcd
     1 root      20   0  191540   5984   4148 S   0.3  0.0   0:05.97 systemd
  6631 root      20   0 3312864  29336   7580 S   0.3  0.0   4:32.37 docker-containe

Nothing unexpected at all. The majority of the CPU is consumed by curl respectively nginx on the other node. What is going on? Let's try with the latest master version of Cilium and compare it to 1.3.0 that Alexis is using.

  Run |       1.3.0 |      master
------|-------------|------------
    1 | 4539 Mbit/s | 4814 Mbit/s
    2 | 4545 Mbit/s | 4814 Mbit/s
    3 | 4543 Mbit/s | 4814 Mbit/s
    4 | 4542 Mbit/s | 4815 Mbit/s
    5 | 4547 Mbit/s | 4815 Mbit/s
Total | 4543 Mbit/s | 4814 Mbit/s

Bingo! Did we unknowingly fix a performance regression that only applies to HTTP and FTP? To understand what is going on, let's look what the benchmark script does specifically:

function bench_kubectl {
        kubectl run --restart=Never --rm \
                --overrides='{"apiVersion":"v1","spec":{"nodeSelector":{"kubernetes.io/hostname":"XXX"}}}' $@
}

RES_HTTP=$(bench_kubectl bench -it --image=infrabuilder/netbench:client \
         -- curl -o /dev/null -skw "%{speed_download}" http://$IP/10G.dat 2>/dev/null| sed 's/\..*//' )

The script uses kubectl run which launches curl as soon as the container has started and then outputs the download speed. Why does this matter?

When a pod launches, kubelet invokes the CNI layer to request the CNI plugin to do the neworking plugin. This call blocks the scheduling of the pod. CNI plugins that support network policy have to ensure that an eventual security policy is correctly enforced from the first possible packet that a pod could transmit or receive. This operation takes some time, any added time will lower the speed at which pods can be scaled up on cluster level. Early on, we discussed the pros and cons of:

Blocking the CNI request until pods labels have been fetched and all networking plumbing is done and to ensure that the first possible packet encounters are fully configured network and no packet drop is every happening.
Optimistically putting as little plumbing in place to allow for TCP retransmissions, return from the CNI request call and fetch the Kubernetes pod labels and completing the networking in the background. Because of eventual network policy that might apply, all traffic is dropped until the pod labels have been fetched. This mode makes use of the nature of TCP to retransmit for a while until networking is fully up and leads to better pod scalability.

We have chosen the faster pod scaling so far but based on user feedback, are switching to the blocking mode going forward. The master tree already contains PR #6299 to change this default and is scheduled to be released as 1.4.0, 1.3.1, and 1.2.6.

Due to how curl measures the download speed, it will start measuring the total time of the download when making the initial connection attempt. This means that the total time includes the period in which networking was still blocked. This reduces the reported troughput rate. OTOH, iperf3 measures the actual transmitted data rate in intervals and adds it up.

Memory consumption

The memory consumption as reported in the benchmark is real but work is underway to provide better defaults while allowing large scale users to continue running aggressive memory pre-allocation settings.

The reasons for this are:

A very aggressive pre-allocation strategy of any datapath related kernel data structure including state required to maintain connection tracking. This ensures that none or as few memory allocations are required when performing work for individual network packets. This is done to keep latency low and stable. PR #6357 is being worked on to disable pre-allocation with a --preallocate-bpf-maps option to continue optimizing for latency.
Enabling IPv6 by default. This leads to data structures being allocated for IPv6 which most users don't use yet. PR #6082 is underway to change this default.

We have definitely optimized for a low latency datapath so far while knowingly sacrificing memory. The 1.4.0 release will provide more control over this with defaults as decided by the community.

Another aspect is that the project is moving very quickly right now. We dedicate a percentage of time to resource optimization for every release and memory consumption has been coming down consequently with each release. We definitely don't claim to have optimize this yet. We believe that BPF gives us the flexibility and efficiency to strike the perfect balance between serving low latency requests benefiting from heavily pre-allocated memory tables and high throughput environments which do not require this at all.

CPU utilization

The last piece to look at is the CPU consumption. One thing to note here is that the article includes a graph that shows the overhead in permil instead of percent and the graph does not scale up to 100%. To put everything a bit into perspective, the graph on the right shows the same numbers at a scale of 100%.

Let's try a different workload type to and trigger more CPU consumption by the networking layer. We are running wrk in a pod on one Kubernetes node configured to make as many HTTP requests as possible with 12 threads and 400 parallel connections to nginx running on another Kubernetes node.

wrk -t12 -c400 -d30s http://10.163.228.10:80/index.html
Running 30s test @ http://10.163.228.10:80/index.html
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    50.04ms  117.73ms   2.00s    87.14%
    Req/Sec     3.43k     1.61k   13.60k    75.93%
  1214934 requests in 30.10s, 271.07MB read
  Socket errors: connect 0, read 0, write 0, timeout 53
Requests/sec:  40362.94
Transfer/sec:      9.01MB

wrk is able to roughly send 40'0000 requests per second. Let's look at the CPU consumption on both the sender and reciever:

top - 03:03:12 up 1 day,  8:36,  3 users,  load average: 1.16, 0.70, 0.37
Tasks: 510 total,   1 running, 259 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  1.9 sy,  0.0 ni, 91.7 id,  0.0 wa,  0.0 hi,  5.7 si,  0.0 st
KiB Mem : 26403225+total, 25511452+free,  2000124 used,  6917612 buff/cache
KiB Swap:  1995772 total,  1995772 free,        0 used. 26021649+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4635 root      20   0  106524   6288   1728 S 282.5  0.0   0:29.02 wrk
  132 root      20   0       0      0      0 S   1.3  0.0   0:26.84 ksoftirqd/20
    8 root      20   0       0      0      0 S   1.0  0.0   0:20.08 ksoftirqd/0
   24 root      20   0       0      0      0 S   1.0  0.0   0:07.93 ksoftirqd/2
   48 root      20   0       0      0      0 S   1.0  0.0   0:07.11 ksoftirqd/6
   60 root      20   0       0      0      0 S   1.0  0.0   0:11.13 ksoftirqd/8
   72 root      20   0       0      0      0 S   1.0  0.0   0:06.87 ksoftirqd/10
   84 root      20   0       0      0      0 S   1.0  0.0   0:20.00 ksoftirqd/12
   96 root      20   0       0      0      0 S   1.0  0.0   0:07.65 ksoftirqd/14
  108 root      20   0       0      0      0 S   1.0  0.0   0:03.32 ksoftirqd/16

The majority of the CPU is consumed by wrk itself which is consuming roughly 3 cores in total spread over 12 threads.

On the receiving side.

op - 03:02:07 up 1 day,  8:36,  2 users,  load average: 0.47, 0.37, 0.28
Tasks: 512 total,   2 running, 263 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.9 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  2.3 si,  0.0 st
KiB Mem : 26403224+total, 23441790+free,  2124056 used, 27490276 buff/cache
KiB Swap:  1995772 total,  1995772 free,        0 used. 26002094+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
41591 systemd+  20   0   33120   3336   1964 R  99.7  0.0   0:37.05 nginx
11463 root      20   0 10.034g  22268  12872 S   1.0  0.0   1:02.30 etcd
    9 root      20   0       0      0      0 I   0.3  0.0   0:11.23 rcu_sched
   72 root      20   0       0      0      0 S   0.3  0.0   0:06.98 ksoftirqd/10
  132 root      20   0       0      0      0 S   0.3  0.0   0:12.50 ksoftirqd/20
  168 root      20   0       0      0      0 S   0.3  0.0   0:10.03 ksoftirqd/26
37246 root      20   0       0      0      0 I   0.3  0.0   0:00.26 kworker/u97:0

The receiving side reveals the bottleneck which is nginx running as a single pod. You can also see various ksoftirqd which are kernel software interrupt threads which are handling the network receive load. You can see that no significant CPU resources are being spent on the networking stack itself.

Conclusion

This is where we are right now with our investigation. We did not find a generic performance regression. The HTTP and FTP related performance has been tracked down to asynchronous handling of the CNI API which will change in 1.3.1, 1.2.6 and 1.4.0. 1.4.0 will make the aggresive pre-allocation optional and disable IPv6 to optimize the memory footprint.

A big thank-you again to Alexis Ducastel for sharing the results. Feedback like this helps everybody improve.

Analyzing the CNI performance benchmark

Summary

TCP and UDP benchmarks

HTTP and FTP benchmark

Memory consumption

CPU utilization

Conclusion

Popular posts

Interview: Hubble Integration Added to DigitalOcean Kubernetes

Cilium 1.15 – Gateway API 1.0 Support, Cluster Mesh Scale Increase, Security Optimizations and more!

Cilium in Paris! See you at KubeCon Europe 2024