Molt bé: spaCy 3.1 learns Catalan, offers SpanCategorizer, and lets you tell it what’s wrong in the first place

spaCy 3.1

A good half year after making its third major release available, the team behind natural language processing library spaCy is back with version 3.1 and trained pipelines for Catalan and Danish.

Additional pipelines aside, spaCy 3.1 is the first to sport a SpanCategorizer component, for labelling arbitrary and potentially overlapping spans of text. The experimental feature is meant to help with the task of extracting text spans that don’t fit the categories of nouns and self-contained expressions, and can be helpful in cases where sentence segments belong to several entities at the same time. 

In order to train the component properly, there’s currently a corresponding UI and workflow in the works for the team’s annotation tool, Prodigy. Devs wanting to give that a try can apply for the Prodigy nightly program, which already includes a preview.

SpaCy’s entity recogniser, meanwhile, has been updated with a nifty new feature enabling users to let the component know about incorrect span annotations through the Doc.spans section in the training data. By providing these, researchers already familiar with commonly misclassified text spans can make the recognition process a little bit more accurate.

In order to make more use of predicted annotations when utilising a multi-stage approach during model training, spaCy 3.1 comes with an additional configuration setting. If desired, developers can supply [training.annotating_components] to define which component should set annotations on the predicted docs during training, so that these can be used as features in later steps.

Since the advancement of natural language processing is pretty much a community effort, there is now an extension package to upload trained and packaged spaCy pipelines  to the Hugging Face Hub. Once installed, it provides a CLI command for uploading, and takes care of generating meta information – providing the Hub with a readme and handling the package’s version control.

Smaller improvements include alpha tokenisation support for Azerbaijani, resizability for TextCatCNN and TextCatBOW architectures, part-of-speech tag-based lemmatisers for Catalan and Italian, as well as a whole list of bug fixes. 

The collection of spaCy resources, Universe, also grew in recent months and now offers a wrapper to use spaCy in Ruby, a toolkit for weak supervision for NLP tasks, a Python rule processing engine, and a tool to connect vowpal-wabbit and scikit-learn models to spaCy, amongst other things.

Though not a major version update, developers should give the upgrading notes a quick read before upgrading their installations. For one, the spaCy team warns that pipeline performance could be slightly different. Users experiencing degradations when running 3.0 pipelines are advised to retrain them with the current release. 

The new version also requires users to include the source model’s vectors in [initialize.vectors] when sourcing pipeline components that require static vectors, and works with Python warnings – so a bit of reorganising is necessary to manage warnings. Details can be found in the project’s repository.