Facebook-driven machine learning framework PyTorch has made it past the 1.10 mark and now comes packed with 3400 additional contributions meant to stabilise distributed training, simplify the move from research from production, and make neural network runs on Android more efficient.
The team behind the TensorFlow competitor was especially keen to highlight the integration of CUDA Graph APIs in v1.10, as this is meant to reduce the CPU overheads for workloads using Nvidia’s approach to parallel computing. This wasn’t the only effort to improve performance however: PyTorch Profiler for instance was modified to show active memory allocations during a program run and provide more recommendations for model optimisation.
A JIT compiler for CPUs that can merge torch library call sequences together is meant to help speed things up as well, although similar functionality was available for GPUs before.
After some polishing, the torch.fx toolkit to simplify transform implementation, along with the torch.special collection of various error handling, information theory, and statistical functions, were promoted to stable with the release. The distributed tracing component saw a couple of features maturing as well, meaning that the functionalities for operating remote modules on remote workers in the same manner as local ones and overriding DDP gradient synchronization are now out of beta.
The framework’s support for working with Android’s Neural Networks API will meanwhile stay in preview for a little while longer, but now contains more op coverage, support for load-time flexible shapes, and the ability to run models on the host for testing.
Apart from those internal enhancements, the PyTorch team also tackled the often cited problem of getting machine learning applications into production. The result is a new SDK called TorchX, which promises components that encode MLOps best practices and easy access to functionalities for distributed training and hyperparameter optimisation.
Built-ins included cover all sorts of capabilities from data loading and validation modules, through to serving, quality monitoring, and metadata management, and can be converted into pipeline stages through an adapter. TorchX is said to work with local schedulers, Kubernetes and Slurm, as well as Kubeflow Pipelines.
Speaking of pipelines, the TorchAudio library is now available in v0.10 and offers the option of building text-to-speech pipelines with vocoder implementations. It also comes fitted with a new transformer architecture for automatic speech recognition in low-latency streaming scenarios, and includes support for differentiable Minimum Variance Distortionless Response beamforming on multi-channel audio, the HuBERT model architecture and pre-trained weights.
Computer vision practitioners might be interested in library TorchVision, which in its current release comes with 22 pre-trained weights for classification variants of the RegNet and EfficientNet architectures. Other additions are a new feature extraction method using the torch.fx toolkit, and simpler automatic data augmentation techniques RandAugment and Trivial Augment.