Codification sneaks into the NLP space, as spaCy lib hits third major release

spaCy 3.0

Natural Language Processing library spaCy 3.0 has been pushed into the open, bringing a slew of features worthy of a major release.

spaCy was first presented in 2015 as a tool for production-grade text processing in Python and Cython, mainly targeting smaller companies interested in NLP. Taking advantage of the latest research in the field, the MIT-licensed open source library currently promises support for tokenization and training for 60+ languages.

For the new release, the spaCy team seems to have focussed on making the processes involved in building a NLP application easier to automate by introducing an anything as code approach. Developers are now free to define their training runs in a configuration file, while ops can use a new projects system for describing build and deployment workflows for spaCy pipelines in a similar manner.

Having those processes available as code makes it easier to version experiments and track changes, while also providing a way of sharing established workflows between teams and ensure it’s easy to rerun a series of actions. The spaCy team highlights this in the training context, since the config file includes all settings and hyperparameters, which means all defaults are easily accessible so that it’s harder to accidentally get them wrong when repeating a training run. 

It also encourages developers to customise models and combine different implementations of an algorithm via wrappers around popular frameworks such as PyTorch, and TensorFlow. To get started, those interested in the training part can make use of a new widget. Those more on the operational side can give spaCy’s repo a closer look, as it contains a variety of templates for different tasks and workflows which can be cloned and adjusted. 

Pipeline builders should also check the documentation to get some idea about various newly added, trainable pipeline components for things like sentence segmentation, rule-based and lookup lemmatization, and a base class for implementing custom pipeline elements.

To improve accuracy, the 3.0 release is the first to feature transformer-based pipelines, which letallowed developers to use a pre-trained transformer like those in PyTorch to train their own pipelines or implement multi-task learning by sharing transformers between components. Parallel training has also become an option thanks to a new extension package called spacy-ray. With it, teams can use the Ray framework for building distributed applications to train their models on several remote machines to speed up the process.

Other than that, the spaCy team has been busy retraining existing model families and pipelines so that they can yield better results, adding a new data structure for storing overlapping spans and a new module to match patterns within the dependency parse.

As this is a major release, there are some breaking changes such as the Python dependency update to 3.6+ and a number of API adjustments that need to be looked into before making the switch. Teams who have been using spaCy 2 can find an extensive list of what to consider when planning their upgrade in the project’s documentation

Commercial migration support for spaCy pipelines has been introduced to help with the process, so organisations with some cash to spend can learn how to make the most of the capabilities added in v3.0 straight from the folks who implemented them.