Why Kubernetes? Too many issues with the supposedly simpler AWS Elastic Container Service, says Figma

By Tim Anderson

August 13, 2024

Why Kubernetes? Too many issues with the supposedly simpler AWS Elastic Container Service, says Figma

Figma software engineering manager Ian VonSeggern, has described the rationale and outcome of its migration from AWS ECS (Elastic Container service) to AWS Kubernetes (EKS). Figma runs a service for collaborative design and prototyping of web applications.

Although Kubernetes is often perceived as a platform for microservices, VonSeggern said that “we’re not a microservices company, and we don’t plan to become one.”

Why migrate from ECS? ECS is an orchestration service for containerized applications and potentially simpler to manage than Kubernetes. According to AWS, “using Amazon ECS decreases the number of decisions customers must make around compute, network, and security configurations, without sacrificing scale or features.”

Figma though found that ECS was causing rather than reducing complexity, particularly as it introduced services that are expected to run on Kubernetes. One example was etcd, a distributed configuration data store that normally runs on Kubernetes and uses a Kubernetes feature called StatefulSets. Figma wanted the benefits of etcd but running on ECS, for which the team created custom code which proved “fragile and hard to maintain.” Similarly, teams at Figma wanted to use Helm charts for deploying standard packages, but Helm charts also expect to find Kubernetes.

In other words, Figma found itself swimming against the stream by not using what has become an industry standard platform. “Running on ECS meant we were missing out on all the open source technology in the Cloud Native Computing Foundation (CNCF) ecosystem,” said VonSeggern, as the team also looked enviously at projects including Keda for autoscaling and Envoy as a service mesh.

Figma believed that Kubernetes and its ecosystem will receive “significantly more investment than Amazon will be able to put into ECS.”

Other factors mentioned are avoiding lock-in, and having an easier time hiring people with Kubernetes expertise rather than for ECS.

Figma took a year or so to migrate most of its services to EKS. The company chose to run three separate EKS clusters to increase resiliency. The main reason for failures, VonSeggern implies, is bugs and operator errors, and he says that the three-cluster set up has reduced the “blast radius” of such issues on multiple occasions. One such occasion was when “an operator accidentally performed an action which destroyed and recreated CoreDNS on one of our production clusters.”

Can EKS be cheaper to run than ECS? VonSeggern does not give figures, but said that there are cost efficiencies with EKS. “For our ECS on EC2 services, we simply over-provisioned our services so we had enough machines to surge up during a deploy,” he wrote, whereas on EKS the open source Karpenter project is used to scale the number of nodes (the VMs used by EKS) dynamically. Figma has also focused on migrating its most expensive service to ARM-based Graviton processors, which are more efficient, though Graviton VMs are also available for ECS.

Another example, used after rather than during the migration, was introducing Datadog’s open source Vector, a Kubernetes sidecar project built in Rust, for processing and forwarding logs. This was more streamlined than the previous approach, which wrote logs to the AWS Cloudwatch service, then processed them on the serverless Lambda platform, and then wrote them to third party Datadog and Snowflake analytics services. “The intermediate storage on Cloudwatch was getting expensive,” said VonSeggern.

Why Kubernetes? Too many issues with the supposedly simpler AWS Elastic Container Service, says Figma

Zig lead makes "extremely breaking" change to std.io ahead of Async and Await's return

Microsoft SQL Server MCP tool: 'Leap in data interaction' or limited and frustrating?

Cloudflare container platform in public preview with scale to zero pricing, some initial limitations

Microsoft to finally expunge the Azure AD Graph API

Avalonia UI sponsorship 'completely removes' open source vs commercial conflict claims CEO

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier

"Serious" MySQL bug celebrates 20 years unfixed - another reason to switch to PostgreSQL?

React ecosystem is fractured but Vercel is not the villain, argues Redux maintainer

CloudBees opens MCP server so agents can infiltrate DevOps

AI is generating code at scale – but human scale code review can’t keep up

CNCF pitched into backup mode as Salesforce pulls free enterprise Slack

Misconfigured GitHub Actions could leave repos and secrets exposed, Sysdig finds

ABOUT US

FOLLOW US