DeepSpeed and T-NGL – Microsoft’s way of giving computational complexity ZeRO thought

DeepSpeed and T-NGL – Microsoft’s way of giving computational complexity ZeRO thought

In a rush to blast through the restrictions set by current-day hardware, Microsoft has introduced a new generative language model to a selected audience and shared a bit of the tech behind it with the open source community.

The new apple to the company’s AI eye is called T-NLG, which is short for Turing Natural Language Generation and part of Microsoft’s Project Turing. Why the project was named after the British luminary can be guessed, it’s official aim however is to “scale deep learning efforts at Microsoft to solve customer and business problems across various products, starting with Search”.

The base for improvement isn’t any old search, though, but Google’s web search, since the use case, getting a direct answer to a question posed, is something neither Ecosia nor DuckDuckGo tackle in the way presented in the Turing NGL blog post

If the answering questions use case isn’t your cuppa, there’s plenty more to do with T-NLG. “T-NLG is a Transformer-based generative language model, which means it can generate words to complete open-ended textual tasks. In addition to completing an unfinished sentence, it can generate direct answers to questions and summaries of input documents.”

To get there, Microsoft’s approach includes training 17 billion parameters – just for comparison, Facebook’s natural language processing project RoBERTa used “only” around 355 million params. This comes at a price though, as Corby Rosset, applied scientist at Microsoft, points out in the project’s introduction. “Large models offer significant accuracy gains, but training billions to trillions of parameters frequently runs up against fundamental hardware limitations.”

A workaround to this unsurprising fact was implemented in the form of DeepSpeed, which is now openly available under a MIT license. The project’s repository touts it as a way to “train DL models with over a hundred billion parameters on current generation of GPU clusters, while achieving over 5x improvement in system performance compared to the state-of-art”.

DeepSpeed is meant to be used in concert with PyTorch, which might make it appealing to those working with the deep learning library. It also comes with the much highlighted ZeRO optimiser as one of its core features. ZeRO promises to “greatly reduce the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained”.

This is realised by a partitioning of model states, which supposedly saves a lot of memory when compared to data parallelism approaches that replicate memory states across processes. For better scalability it can also be combined with model parallelism approaches, while DeepSpeed’s support for advanced hyperparameter tuning and large batch size optimisers helps with effectiveness.

Unlike DeepSpeed, however, T-NLG isn’t quite ready for public consumption, since Microsoft has only released a private demo “to a small set of users within the academic community for initial testing and feedback”. 

Maybe some of this feedback could address the explainability of the system’s output. Or even go deeper into the energy needed for computing models as complex as this, which has become quite a discussion point with some devs getting more aware of tech’s contributions to climate change.

But it probably won’t and Bing will just start answering your question about the Oscar 2020 winners with people’s names instead of presenting you with some Mirror article.