Attendees at the Amazon Web Services’ re:Invent conference, well under way in Las Vegas this week, were encouraged to consider cell-based architecture for resilience at scale, as well as using chaos engineering to verify that resilience.
Resilience, explained principal developer advocate Seth Eliot in a session on the subject, is “the ability to design an application so it either avoids completely or mitigates the faults and load spikes you’re going to find in production … resilience is a continuous operation, not just something you build in.”
In the early days of AWS (Amazon Web Services) some speculated that it was a way of utilizing space capacity from Amazon.com – the retail site – though Bezos’s troops denied that was ever the case. Now though, Amazon.com is an internal customer.
“Amazon runs on AWS, you can see DynamoDB, you can see Aurora [relational database service], you can see EBS [cloud block storage], you can see all kinds of AWS services,” explained Eliot. Then he showed a slide that looked nothing like an architecture diagram, but more like a brain scan. “Each of those dots is a service or microservice,” he noted. “And the lines between them are the lines of dependency. There are tens of thousands of services there.”
An Amazon.com web page, he explained, “is making hundreds of calls in parallel very quickly, to render this page.” Every element is contained in a widget, to enable graceful degradation if one of the back-end services fails. This is possible because each widget is independent of the others.
Eliot cautioned that developers should be concerned about scalability and resilience, even with small applications. “Even if your application is small scale today, it might not be that way tomorrow; and the principles about resilience apply at any scale.”
The major focus of this session was on cell-based architecture as a route to resiliency. According to AWS documentation, “A cell-based architecture uses multiple isolated instances of a workload, where each instance is known as a cell.” The idea is that if one fails, it does not impact the others.
How is cell-based architecture distinct from using microservices? “Microservices means that you’re dividing your business logic into multiple smaller services that do just one thing and which talk to each other. Cell-based means that the service itself, whether a microservice or not, can divide into cells, and cells all run the same stack – it’s a way of providing a fault boundary and also a scalability mechanism by which we can add cells to scale out.
“It’s really important when you are designing the cells that you check the health of the cells, because you don’t want to route a request to a cell that’s underperforming,” stressed Tulip Gupta, senior solutions architect. She suggested using Route 53 health checks to check cell health, as well as Cloudwatch monitoring. According to Gupta, with these measures the AWS Prime Video service achieved 99.9996 percent availability.
Another topic discussed was chaos engineering, which in the AWS ecosystem is on offer via FIS (Fault Injection Service). The FIS can introduce errors, including making an EC2 (Elastic Compute Cloud) instance unavailable, stopping an RDS (Relational Database Service) database, raising an alarm in Cloudwatch, network disruption, and causing high CPU or memory usage.
Avinash Kolluri, another senior solutions architect, talking about a different internal case study, explained that using FIS “gave us good confidence in our infrastructure and workload being operated in multiple availability zones.”