PyTorch lights up version 1.6, follows competition down the profiling route


Just one day after TensorFlow hit version 2.3, Facebook’s challenger project PyTorch was updated to 1.6, sporting support for automatic mixed precision training and a changed classification scheme for new features.

The classification system will fall into one of three categories: stable, beta, or prototype. Beta corresponds to what had been known as experimental features, meaning there is a proven added value, but the API could still change or there are performance or coverage issues yet to tackle. Examples for features in this category include custom C++ classes, named tensors, and PyTorch Mobile.

Prototypes are meant for getting “high bandwidth” feedback on the utility of a proposed new feature in order to either commit to getting it to beta or let it fall by the wayside. Prototypes aren’t part of a binary and only available for those building from source or using nightlies or the associated compiler flag, which is why a couple of neat additions such as a profiler for distributed training or graph mode quantisation are a bit trickier to access.

Interestingly enough, the feature focus of PyTorch 1.6 is similar to that of the latest TensorFlow (TF) release, adding new profiling tools and performance improvements to the project. The PyTorch team even implemented a memory profiler, though it doesn’t seem to come with the fancy kind of visualisation TF decided to go with. That being said, the profiler is marked as beta only, so someone could still take it on themselves to implement additional representations. 

PyTorch 1.6 also packs a couple of other betas, such as a new backend for the RPC module that has been introduced in version 1.4 for multi-machine model training. The new backend uses the TensorPipe library which offers ways of pairwise and asynchronous communication, opening RPC up to client-server scenarios and model and pipeline parallel training.

Starting with v1.6, RPC also includes asynchronous user functions to keep the system from running into performance issues due to RPC threads waiting for a return from user-defined functions. The module has also learned to work together with DDP, a module responsible for full sync data parallel training of models. This is meant to help users to experiment with mixtures of distributed and data parallel approaches which can be helpful for models with sparse and dense parts or scenarios in which executions are pipelined across multiple machines for speedup.

In other performance news, the automatic mixed precision feature proposed in August 2019 has been marked stable in PyTorch 1.6. It is mainly of interest to users of Nvidia’s CUDA architecture, since one of its main tasks is to automatically cast CUDA operations to FP16 or F32 during training, depending on what is best for an operation.

Moreover, the library’s scripting language TorchScript has seen the introduction of torch.jit.fork and torch.jit.wait for executing programs written in the language in parallel. Use cases include running bidirectional components of recurrent nets or ensemble models in parallel.

The PyTorch team also used its 1.6 announcement to let users know that Windows builds and binaries of the library are now maintained by Microsoft, who’ll also jump in and help developers in the various discussion channels. Last year the company ceased work on its own deep learning toolkit CNTK, stating it would instead focus on committing to other open-source projects. 

More details on the new release, along with information about the vast amounts of bug fixes included, can be found in the release notes.