TensorFlow gets text preprocessing library for wordy models

AI/ML

By Team Devclass

June 11, 2019

TensorFlow gets text preprocessing library for wordy models

TensorFlow Text is a newly launched library that is meant to help machine learning practitioners working with text to preprocess their data without having to leave the TensorFlow graph.

To build models from text, the initial data often has to be prepared before any further operations can happen. Preprocessing steps such as breaking the string of text up into tokens like words and numbers or just normalising unicode are normally not done in a graph, which is what TensorFlow uses (if it isn’t utilised in eager mode for evaluation), though.

This isn’t bad per se, but can lead to an offset, which is something that TF.Text is meant to mitigate with the implementations included. The latter range from functions to transcode strings to UTF-8, to those that fold cases, normalise Unicode input, and three Tokenization approaches.

While the first splits a string as soon as it finds an ICU defined whitespace character such as a tab or a space, the others break a string up when a Unicode boundary shows up or a character has been identified.

To tempt those working with text even more, the tool also offers functionalities that can assist in sequence modeling by finding patterns such as word capitalization or punctuation in input data. Examples for making the most of the library can be found in the project’s repository.

TF.Text is compatible with TensorFlow’s eager mode as well as its graph mode and is meant to be used with the upcoming v2.0 of the numerical computation library. A beta of the latter has been recently released.
Since TensorFlow Text is a Python library, it can be easily installed via the pip package management system (pip install -U tensorflow-text). Its sources are available on GitHub, where the project is hosted under the Apache License 2.0.

TensorFlow gets text preprocessing library for wordy models

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier

CloudBees opens MCP server so agents can infiltrate DevOps

AI is generating code at scale – but human scale code review can’t keep up

Redefining identity security in the age of agentic AI

GitLab warms up investors for winter release of agentic AI flavoured Duo Workflow

New Relic aims to crack open MCP servers

Shadow AI in the enterprise: managing risk without slowing progress

Cursor AI editor hits 1.0 milestone, including BugBot and high-risk background agents

Node.js frustrating and inefficient? OpenAI rewrites AI coding tool in Rust

Researchers warn of prompt injection vulnerability in GitHub MCP with no obvious fix

MCP will be built into Windows to make an 'agentic OS' but security will be a key concern

Stack Overflow seeks rebrand as traffic continues to plummet – which is bad news for developers

ABOUT US

FOLLOW US