Interview: Why AWS prefers VMs for code isolation, and tips on developing for Lambda

At last week’s re:Invent in Las Vegas, DevClass caught up Anthony Liguori, AWS Distinguished Engineer to talk about virtualization, the newly launched Snapstart and best practise when building serverless applications with Lambda.

Liguori is a virtualization specialist who was project leader for QEMU for nearly 6 years while at IBM, and now works on Amazon’s EC2 (Elastic Compute Cloud) where he leads virtualization development and the design and implementation of Nitro, the hardware which sits underneath EC2 instances providing virtualization, networking and storage I/O. Another of his projects is Firecracker, a lightweight virtual machine monitor.

AWS Distinguished Engineer Anthony Liguori at re:Invent

Virtualization is at the heart of the AWS platform not only for EC2 but also for Lambda, for serverless functions. A Lambda execution environment is a lightweight virtual machine. What does Liguori think about other approaches to function isolation, such as containers, or even Isolates in the V8 JavaScript engine?

“At AWS, all customer multi-tenancy isolation is done through virtual machines,” he told Dev Class. “That’s something we feel very strongly about and it’s unique at AWS, not everybody follows that model. We prefer virtual machines for two reasons. The hardware virtualization boundary is smaller, it’s easier to think about from a security point of view. The other reason is that it’s a proven technology. Security is the most important thing we do, and for any other technology, we would have to be confident in the security characteristics before we ever used it for tenancy purposes,” Liguori added.

“We’ve actually gone in the opposite direction and we also offer metal instances, because we’ve moved the security architecture from relying on the hypervisor to baking it into the underlying hardware. Metal instances allow you to be on the direct bare metal processor.”

This is supported with the Graviton Arm-based processor as well as on Intel and AMD, Liguori said.

AWS said at re:Invent that it has a “bias towards serverless” even with its own software development. How has that impacted the engineering side?

“We launched Nitro in November 2017 and almost immediately started working on Firecracker, and part of the reason is that with the Nitro system we’ve moved a lot of the virtualization capabilities into the underlying hardware and that gives you good performance, efficiency, strong resource isolation. At the other end of the spectrum with Firecracker we favour flexibility, launch time, things of that nature. At re:Invent we launched Snapstart which is a snapshot-based mode for Lambda. Even though Firecracker can start an instance in under 100 millisecords, often customers have a lot of initialization code. They might be running Java and the JVM takes a long time to initialize.

“So even if the underlying serverless mechanism has a very fast startup time, the customer’s application might still see cold starts. With the snapshot start we let the function fully initialize and then snapshot it. That snapshot is able to be resurrected down the road, and no matter what you’re doing in your initialization code, you’re always going to get effectively free cold starts.”

Why is the feature currently limited to Java functions? “One of the things that made that feature hard to implement is entropy. If you just snapshot a virtual machine, one of the things you’re also doing is snapshotting the random number generator in that virtual machine. True randomness is important to cryptography, so this is one of the reasons why today we’re only supporting Corretto [the AWS JVM], because we’ve made some modification to the JVM to manage this,” said Liguori.

“We’ve been engaged with various communities, particularly the Linux kernel, trying to get better capabilities to manage this entropy problem,” he added.

What sort of work is involved? The issue is that computers use true entropy to seed a random number generator, then “once it’s seeded, you use a pseudo-random number generator to generate lots of values quickly. If you take that seed and duplicate it, a pseudo-random number generator in two functions will generate the same random numbers. They’re not random any more,” Liguori explained.

“The solution is to signal to the operating system and all the software that if you have a pseudo-random number generator you have to re-seed it. It’s not all that complicated to do, but providing these APIs takes some time working with communities,” Liguori continued. The hope is that further runtimes will support SnapStart relatively quickly. “It’s really just a matter of working with the various communities,” he said.

Another new feature introduced at re:Invent is the 5th generation of Nitro. “Today we’ve built over 20 million Nitro cards,” said Liguori. The latest cards now support 200 Gb/s networking, he Liguori added.

What should developers bear in mind in order to optimize applications running on EC2 or Lambda? “Right now the most important thing is think about having your application support multiple architectures,” said Liguori. “We’ve seen this with so many customers. With Graviton, often all you have to do is recompile your application and then save 40 percent on your infrastructure costs,” he said.

Most libraries are now supplied as source code and can just be recompiled, Liguori suggested. What is the blocker then?

“Often it’s the build infrastructure, they’ve only ever tested on x86, or maybe their pipelines make assumptions about x86,” Liguori told us. “There are some edge cases. There is a different in memory model. Intel has something called Total Store Ordering which means you don’t have to think about a lot of multi-threading problems. If you’re writing normal code using mutexes and locks and higher level primitives, you don’t have to worry about this, it all works the same on Arm. If you have custom assembly code or are doing unusual things, that is something where you need to appreciate the difference between Arm and x86,” he said.

What about the size of the code that goes into a Lambda instance, what is optimal?

“I don’t think the size of the code is all that important,” said Liguori, particularly with SnapStart. “The thing that most customers should think about is how much work is an individual function doing and how many things does it have to call? If each function is making many external service calls, and then something else starts calling that function a lot, you can get a combinatorial explosion where you end up having a much higher request rate than you would expect. So being very mindful about what services a function is calling, passing data around versus re-fetching it, those are the types of things it is pretty helpful to do when building serverless applications with Lambda,” he said.