Google’s Bert slashes NLP tuning times, gets open-sourced

Google’s Bert slashes NLP tuning times, gets open-sourced

Google has promised to slash the time you need to train a question/answer system to as little as 30 minutes by opensourcing its pre-training model, Bert.

Bert stands for Bidirectional Encoder Representations from Transformers, as this blog from Google research scientists Jacob Devlin and Ming-Wei Chang explains. They claim that Bert is “the first deeply bidirectional, unsupervised language representation, pre-trained only using a plain text corpus.” You’ll be pleased to know the plain text corpus in question is Wikipedia.

They also explain how pre-training helps close the data gap in training NLP models, by generating a model based on “the enormous amount of unannotated text on the web”, which can then be fine-tuned on small data tasks, resulting in big improvements in accuracy.

The search giant compares its model to both context-free and contextual methods of pre-training. In the former, such as Word2Vec, a single word embedding representation is generated for each word – which creates problems where a single word has multiple meanings, for example, bank. (Or maybe, transformer) Contextual models use the other words in a sentence to help generate a representation, but to date, these are usually unidirectional models with the context based on the preceding works.

However, as Google puts it, Bert takes into account both preceding and following words, so “starting from the very bottom of a deep neural network, making it deeply bidirectional.”

Which begs the question, why hasn’t this been done before. The paper’s authors, say “it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.”

They got round this with “the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words”

In addition, “Cloud TPUs gave us the freedom to quickly experiment, debug and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques.”

They said that “On SQuAD v1.1, BERT achieves 93.2 per F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6 per cent and human-level score of 91.2%”

More compellingly, perhaps, they content that using Bert will allow “anyone to train their own state-of-the-art question answering system…in about 30 minutes on a single cloud TPU, or in a few hours using a single GPU.”

Right now, Bert is English only, but other languages should be available in the near future.