TensorFlow gets text preprocessing library for wordy models

Tensorflow next chapter

TensorFlow Text is a newly launched library that is meant to help machine learning practitioners working with text to preprocess their data without having to leave the TensorFlow graph.

To build models from text, the initial data often has to be prepared before any further operations can happen. Preprocessing steps such as breaking the string of text up into tokens like words and numbers or just normalising unicode are normally not done in a graph, which is what TensorFlow uses (if it isn’t utilised in eager mode for evaluation), though.

This isn’t bad per se, but can lead to an offset, which is something that TF.Text is meant to mitigate with the implementations included. The latter range from functions to transcode strings to UTF-8, to those that fold cases, normalise Unicode input, and three Tokenization approaches.

While the first splits a string as soon as it finds an ICU defined whitespace character such as a tab or a space, the others break a string up when a Unicode boundary shows up or a character has been identified.

To tempt those working with text even more, the tool also offers functionalities that can assist in sequence modeling by finding patterns such as word capitalization or punctuation in input data. Examples for making the most of the library can be found in the project’s repository.

TF.Text is compatible with TensorFlow’s eager mode as well as its graph mode and is meant to be used with the upcoming v2.0 of the numerical computation library. A beta of the latter has been recently released.
Since TensorFlow Text is a Python library, it can be easily installed via the pip package management system (pip install -U tensorflow-text). Its sources are available on GitHub, where the project is hosted under the Apache License 2.0.