Fermyon previews new spin on Serverless AI via Wasm

Fermyon previews new spin on Serverless AI via Wasm

Fermyon, specialists in WebAssembly (Wasm) microservices, has introduced a new serverless AI platform, in association with Kubernetes hosting service Civo.

The service uses Large Language Models (LLMs) from Meta, Llama 2 and Code Llama (AI for coding), which are open source and free to use. “Using WebAssembly to run workloads, we can assign a fraction of a GPU to a user application just in time to execute an AI operation,” said CTO and co-founder Radu Matei in a post today.

The service was announced at the Civo Navigate event in London. Civo provides the GPU compute service which underlies Fermyon Serverless AI, when running on Fermyon Cloud. The company claims that Serverless AI, in private beta, makes AI apps affordable because it avoids the expense of “access to GPUs at $32/instance-hour and upwards.”

Other on-demand AI services exist but do not perform as well, the company said, because of slow start-up times, whereas “Fermyon Serverless AI has solved this problem by offering 50 millisecond cold start times.” 

Fermyon’s approach rests on the efficiency of sandboxed Wasm code versus containers or VMs – a similar approach to that used by Cloudflare Workers, which use V8 Isolates, V8 being the JavaScript engine also used by Google Chrome and Node.js. The downside is that the sandboxing may be less secure than that offered by VMs.

Serverless AI will be a new component of the open source Spin project which is a platform for Wasm microservices. Spin can run locally on a developer machine and be deployed to Fermyon’s own cloud hosting platform or elsewhere. Supported languages include Rust (the primary language), TypeScript, Python, TinyGo or C#. TinyGo is a version of Go which includes Wasm support as well as WASI (WebAssembly System Interface), enabling running outside the browser, which is why it can be supported by Spin. Note also that Go itself now has an experimental port for WASI.

There are some limitations in the preview. Specifically, users can have up to 75 inferencing requests per hour, and 200 embedding requests, where embeddings are a way of persisting text data as a vector of numbers.

Matei said that developers using the Serverless AI preview will be able to execute inferencing on LLMs for Lllma2 and Code Llama, generate sentence embeddings and store, search and retrieve them, cache responses in a built-in key/value database, and run “entire full stack serverless applications” using the service alongside other existing features of the Fermyon platform.

Developers can also run Serverless AI locally but can expect slow performance. Matei quoted delays of 20-30 seconds on an Apple M1 laptop, compared to 750 millieseconds using the cloud service, including the cold start time for the serverless endpoint.

Possible uses include text processing and summarization, chatbots, and generating code from natural language input.