Microsoft’s Bing team reshape Google’s BERT in their own Azure-powered image

Microsoft Azure logo

Researchers’ at Microsoft’s Bing organisation have open sourced a brace of recipes for pre-training and fine-tuning BERT, the NLP model which Google itself open sourced just last November.

Google describes BERT as “the first deeply bidirectional, unsupervised language representation, pre-trained only using a plain text corpus” – the corpus in question being Wikipedia.

Wikipedia’s collective knowledge may be vast, researchers at Microsoft said, and “The broad applicability of BERT means that most developers and data scientists are able to use a pre-trained variant of BERT rather than building a new version from the ground up with new data.”

However, they continued, “it will not deliver best-in-class accuracy when crossing over to a new problem space.” For example, they suggest, a model for analysing medical notes needs a deep understanding of the medical domain, while processing legal documents needs training on, yes, legal documents.


Fine-tuning the model is not enough, they reason, and pre-training is in order. In addition, “users will need to change the model architecture, training data, cost function, tasks, and optimization routines. All these changes need to be explored at large parameter and training data sizes.”

The changes are “quite substantial”, with BERT-large having 340 million parameters, and has been trained over 2.5 billion Wikipedia and 800 million BookCorpus words. Microsoft unsurprisingly chose to do this using its own Azure machine learning service.

“To get the training to converge to the same quality as the original BERT release on GPUs was non-trivial,” wrote Saurabh Tiwary, Applied Science Manager at Bing. “To pre-train BERT we needed massive computation and memory, which means we had to distribute the computation across multiple GPUs. However, doing that in a cost effective and efficient way with predictable behaviors in terms of convergence and quality of the final resulting model was quite challenging.”

The result is two recipes for pre-training and fine-tuning BERT using Azure’s Machine Learning service. The GitHub repo for the work includes a  PyTorch Pretrained BERT package from Hugging Face, and also includes data preprocessing code which can be used on “Wikipedia corpus or other datasets for pretraining.” Raw and preprocessed English Wikipedia datasets, and pre-trained models are provided. 

It also hosts an Azure Machine Learning service Jupyter notebook to launch pre-training, though the code, data, scripts and tooling can run in “any other training environment.” 

- Advertisement -