ONNX runtime sneaks into the auditory realm, gets more mobile friendly

ONNX runtime sneaks into the auditory realm, gets more mobile friendly

Microsoft’s inference and machine learning accelerator ONNX runtime is now available in version 1.7 and promises reduced binary sizes, while also making a foray into audio.

ONNX runtime makes use of the computation graph format described in the open standard for machine learning interoperability ONNX, and looks to reduce training time for large models, improve inference, and facilitate cross-platform deployments. In the last couple of months, the team seems to have been quite focussed on performance, improving quantisation mechanisms such as depthwise conv, QuantizeLinear, and Fusion for Conv, and reducing the memory needed when using the long document transformer’s attention mechanism on CUDA. 

There’s now also support for the QuantizeLinear-DequantizeLinear format, and quantisation for Pad, Split and MaxPool for channel last. Changes in the Python optimiser integrated in the project allow the use of fusion on Bayesian Additive Regression Trees to get performance up. And just so you don’t have to take Microsoft’s word for it, ONNX runtime now also includes a CPU profiling tool that lets you get a better idea of how different transformer models are doing. 

Since deployment of machine learning models on mobile remains somewhat of a challenge, the ONNX runtime team has added an option to let the operator kernel only support those types that are actually used by a model. This promises a “25-33% reduction in binary size contribution from the kernel implementations”, though the creators also point out that the model used also plays into how much can be gained.

Speaking of gains, researchers trying to make use of machine learning in audio-related use cases could soon get more use out of the project, since it does now come with first iterations of some audio operators. These include Fourier Transforms (DFT, IDFT, STFT), various windowing Functions (Hann, Hamming, Blackman), and a MelWeightMatrix operator. To give them a go the project has to be built with the ms_experimental build flag enabled.

Developers who have been using the ONNX runtime with OpenMP before, need to check they’re downloading the right version of the project, since it is now built without the API by default. Builds including OpenMP can be identified by a corresponding suffix (onnxruntime-openmp, Microsoft.ML.OnnxRuntime.OpenMP) and are available separately on PyPi and Nuget. 

The GPU package is currently missing from Nuget due to size restrictions, which Microsoft looks to fix for upcoming releases. Teams interested in ARM32/64 Windows builds can find those in the CPU Nuget and zip packages starting with this release. 

In terms of dependencies it’s important to note that Python 3.5 support has been removed in v1.7 of the runtime, though it has learned to work with versions 3.8 and 3.9. Dependencies on gemmlowp, and build configs for MKLML, openblas and jemalloc have been binned as well. Meanwhile the GPU build is now created using CUDA 11, OpenVINO has been updated to v2021.2, TensorRT to 7.2, and DirectML to 1.4.2 which is meant to help with performance and stability.