Fly.io CEO says reliability ‘not great’ as platform suffers scaling issues

Fly.io CEO says reliability ‘not great’ as platform suffers scaling issues

Fly.io, a platform for easy deployment of container-based applications, is suffering reliability problems following a rapid influx of users.

“We’ve pushed the platform past what it was originally built to do,” posted CEO Kurt Mackey, who co-founded the company in 2016.

Last year Salesforce-owned Heroku, a pioneer in easy application deployment, decided to end free plans, following abuse of its platform. One of the outcomes was heightened interest in other options, including Fly.io.

Mackey listed the core pieces that go to make up the Fly.io platform, including HashiCorp Nomad for scheduling container start-up, gateways based on the WireGuard VPN tunnel which connect Fly.io to private networks, a global Docker registry, proxies for routing traffic to app instances, and more. “These have all failed in unique and surprising ways,” he said.

One major issue concerns service discovery, which informs proxies where to route requests. This originally used HashiCorp Consul, but the company discovered that Consul was unsuitable for a “global service discovery role” because of its centralized design, resulting in stale data. A new project called Corrosion was developed internally, based on a peer-to-peer design called a gossip protocol, but “gossip-based consistency is a difficult problem,” wrote Mackey, and Corrosion was not mature software, resulting in corrupted global service discovery state.

Other problems included failures with secrets storage, Postgres database issues, capacity shortages, and irritating developers with vague posts on the Fly status page. “We’re in an awkward phase where the company isn’t quite mature enough to support the infrastructure we need to deliver a good developer UX,” Mackey concluded.

A customer responded saying, “I’m now very glad I didn’t migrate our production systems,” and asking for “better error handling / debuggability /observability into the flowi.io system.” Another said that “the reliability has been a pretty sore spot for me so far. Hard to think of any other service (compute or not) I’ve used that has had similar spottiness.”

Despite that many customers are supportive, praising the transparency and commenting for example that the platform is “one of the most well designed and tasteful cloud offerings to date.”

The popularity of Fly.io demonstrates the appetite among developers for straightforward application deployment. Developers do not even need to build a container. “fly launch can scan your source code and configure your project for deployment on Fly.io, and our builders will build your app’s Docker image when you deploy,” say the docs.

It is also apparent that Heroku’s changes have cost it substantial momentum. “There is a huge loss of confidence in heroku’s future going on among heroku’s customers right now,” observed one developer, complaining about lack of investment in “edge functionality”, the ability to deploy automatically close to end-users in order to reduce latency.

A question to ponder is why the big cloud providers, specifically AWS, Microsoft Azure, and Google Cloud Platform, have not come up with a service as compelling for developers. Although these hyperscale platforms can easily scale and have better reliability, they are more expensive to run and more complex to configure.

Another issue is whether the provision of a generous free tier is counter-productive, attracting new customers but also creating problems for paying customers.

“If we don’t improve, our company ceases to exist,” wrote Mackey.