Codification sneaks into the NLP space, as spaCy lib hits third major release

AI/ML

By Julia Schmidt

February 2, 2021

Codification sneaks into the NLP space, as spaCy lib hits third major release

Natural Language Processing library spaCy 3.0 has been pushed into the open, bringing a slew of features worthy of a major release.

spaCy was first presented in 2015 as a tool for production-grade text processing in Python and Cython, mainly targeting smaller companies interested in NLP. Taking advantage of the latest research in the field, the MIT-licensed open source library currently promises support for tokenization and training for 60+ languages.

For the new release, the spaCy team seems to have focussed on making the processes involved in building a NLP application easier to automate by introducing an anything as code approach. Developers are now free to define their training runs in a configuration file, while ops can use a new projects system for describing build and deployment workflows for spaCy pipelines in a similar manner.

Having those processes available as code makes it easier to version experiments and track changes, while also providing a way of sharing established workflows between teams and ensure it’s easy to rerun a series of actions. The spaCy team highlights this in the training context, since the config file includes all settings and hyperparameters, which means all defaults are easily accessible so that it’s harder to accidentally get them wrong when repeating a training run.

It also encourages developers to customise models and combine different implementations of an algorithm via wrappers around popular frameworks such as PyTorch, and TensorFlow. To get started, those interested in the training part can make use of a new widget. Those more on the operational side can give spaCy’s repo a closer look, as it contains a variety of templates for different tasks and workflows which can be cloned and adjusted.

Pipeline builders should also check the documentation to get some idea about various newly added, trainable pipeline components for things like sentence segmentation, rule-based and lookup lemmatization, and a base class for implementing custom pipeline elements.

To improve accuracy, the 3.0 release is the first to feature transformer-based pipelines, which ~~let~~allowed developers to use a pre-trained transformer like those in PyTorch to train their own pipelines or implement multi-task learning by sharing transformers between components. Parallel training has also become an option thanks to a new extension package called spacy-ray. With it, teams can use the Ray framework for building distributed applications to train their models on several remote machines to speed up the process.

Other than that, the spaCy team has been busy retraining existing model families and pipelines so that they can yield better results, adding a new data structure for storing overlapping spans and a new module to match patterns within the dependency parse.

As this is a major release, there are some breaking changes such as the Python dependency update to 3.6+ and a number of API adjustments that need to be looked into before making the switch. Teams who have been using spaCy 2 can find an extensive list of what to consider when planning their upgrade in the project’s documentation.

Commercial migration support for spaCy pipelines has been introduced to help with the process, so organisations with some cash to spend can learn how to make the most of the capabilities added in v3.0 straight from the folks who implemented them.

Sourcegraph coding assistant now supports Anthropic Claude 3 – though limited to 7K token input

Supabase moves out of beta, adds supports for Swift, plugs in Oriole storage engine

Go dev survey shows frustration with Python’s dominance of AI

AI coding: Hugging Face engineer extols benefits of open source models, but hard questions remain

.NET Smart Components experiment the "Visual Basic" of AI programming?

GitHub autofix progresses to public beta: insecure code corrected with AI, but only for enterprise

JetBrains bows to user pressure and unbundles AI Assistant in new IntelliJ IDEA beta

Hands On: Netlify AI-assisted deployment aims to reduce log-diving

Stack Overflow turns to Google for hosting and AI features, trusts in Gemini for tech answers

Employing your cloud data warehouse to scale up AI/ML

Rust-based Zed editor now open source – with built-in support for OpenAI and GitHub Copilot

AI assistance is leading to lower code quality, claim researchers

Codification sneaks into the NLP space, as spaCy lib hits third major release

ABOUT US

FOLLOW US