All about performance: etcd 3.5 looks to make Kubernetes users happy

All about performance: etcd 3.5 looks to make Kubernetes users happy

Almost two years after the team behind distributed key-value store etcd released version 3.4 of the project, etcd 3.5 is now ready for general consumption.

Since the last big release, the etcd team found an increase in mission-critical systems being built on top of the project, which informed a large share of the enhancements that can be found in etcd 3.5. Performance especially seems to have been high on the team’s agenda, since etcd is often used in conjunction with container orchestrator Kubernetes — a setup in which shortcomings seem to have become most apparent lately. 

Since the etcd issue tracker has, for example, continued to see users grapple quite a bit with slow reads, excessive memory allocation and out of memory crashes, the etcd team launched some investigations into its heap profile, which uncovered some logger inefficiencies. Taking these into account, the team was able to optimise protocol message size operations and could reduce memory consumption “up to 50 per cent during peak usage” for certain scenarios.

To increase throughput during transactions with update operations, etcd 3.5 has gone back to sharing the buffer between reads and writes. It also caches the transaction buffer to avoid unnecessarily copying operations, which brought improvements for transactions with high read ratios especially. 

Keeping in mind that etcd is often part of a more complex setup, the value store now integrates with OpenTelemetry in order to help enterprises identify the source of a problem within their system. The addition should allow users to trace calls across a chain that spans multiple external components. The team also deprecated capnslog and replaced it with reflection-free, zero-allocation logger zap, implemented log rotation and taught the project to emit more detailed tracing information for expensive requests.

In terms of bug fixes, committers were able to get rid of some memory leaks caused by lease objects pilling up, deadlock issues in the mvcc storage layer, and a defragmentation problem causing crashes. Also, etcd server restarts tended to take very long because of some redundant operations in the backend, which have been taken care of with the v3.5 release.

Codebase and beyond

Under the hood, etcd adopted the module approach of Go 1.16, which brought about some code refactoring and left the codebase reduced by half. The command-line interface includes a new admin tool, etcdutl, with subcommands snapshot and defrag that replace etcdctl snapshot and etcdctl defrag to allow for better dependency tree isolation.

Code changes aside, the project now officially supports ARM64 platforms, and decided on a new security release process to ensure critical issues are responsibly handled. It also started to run automated tests with static analysis tools since the last release, and improved test quality by reducing the runtime of unit tests, simplifying the test data cleanup, and closing gRPC servers after tests.

With all of these things taken care of, the etcd team shared a quick look at what’s to come. The roadmap includes work items such as a revisiting of the etcd throttle feature, support for range streams to make the project more stable, and the development of an automated release system