Google intros helpful TensorFlow Recorder, but warns you’ll need to cough up for ‘huge datasets’

Google intros helpful TensorFlow Recorder, but warns you’ll need to cough up for ‘huge datasets’

Just a couple of days after pushing out the latest TensorFlow release, Google has open sourced a new tool for the machine learning framework aiming to push the record format forward.

TensorFlow Recorder is available at GitHub under the Apache License 2.0 and is meant to help creating TFRecords from “images and labels in Pandas DataFrames or CSV files”.

According to Google Cloud AI engineers Mike Bernico and Carlos Ezequiel, the project has become necessary in a computer vision context, where data loading can take quite a while when not formatted properly. As a consequence, resources aren’t used as efficiently as they could be, making an already time-consuming process even lengthier.

When using TensorFlow to build models for these kinds of use cases, the project’s record format is one way to work around this bottleneck, since it can be combined with approaches like prefetching, which gets data for the next processing steps before it’s needed, and interleaving for parallel processing, to reduce latency. 

To be able to get there, the raw data has to be converted, which requires some work not everyone is willing to put in. This is where Bernica and Ezequiel hope TensorFlow Recorder will come in, providing users with a comparatively easy way to go from image/label sets to TFRecords with only little additional code. 

However, for now the tool will be most useful to those already familiar with Google’s portfolio, since Recorder expects the data to come in an image csv format similar to the one AutoML Vision prefers. The team “hopes” to extend format support in the future, but since the project is open source now, this feels more like a call to users to maybe do their bit to add Pandas DataFrame conversion to the mix.

Another caveat is the fact that its creators say that – as is – the project wouldn’t scale to “huge datasets” of millions of images. Since those datasets can indeed be necessary for more complex computer vision tasks, though, TensorFlow Recorder can be connected to Google Cloud Dataflow which should be better able to handle large amounts of data. 

Of course having this option in place is very helpful, but it again pushes users into the direction of one of Google’s commercial offerings, which seem to become more and more present in the open source project as of late. Other examples for this development are the continuing focus on TPU integration for speed improvements and some packages making their way into Google Cloud Storage – something users should at least be aware of.