Thanks to LinkedIn and its newly open sourced TonY, Machine learning afficionados now have another way to run TensorFlow on Apache’s big data platform Hadoop. Distributed machine learning, here we come.
Apparently, LinkedIn engineers use deep neural networks to implement features like feeds and so-called smart-replies for the platform’s messaging function. The datasets used in the process are stored in a Hadoop-based platform, however, and tend to be quite massive. To process them, the AI team uses distributed training with TensorFlow, Google’s Machine Learning library, as some members describe in a recent blog entry. Since orchestration proved to be a bit tricky, LinkedIn – of course only after having a stroll through the ecosystem – started to work on TonY, short for TensorFlow on YARN.
It is made up of a client, an ApplicationMaster and an TaskExecuter and is supposed to handle tasks such as resource negotiation and the setup of container environments for TensorFlow jobs on Hadoop. Users of the project would start by submitting their code for TensorFlow model training, submission arguments, and their virtual Python environment including the TF dependency to a client.
The client then sets up an ApplicationMaster, which it submits to the YARN cluster. Negotiations about the resources are done by the ApplicationMaster based on the user’s requirements. It also spawns TaskExecutors on allocated nodes, which in turn launch the training code.
While it runs, TonY periodically checks if both TaskExecutors and ApplicationMaster are still live. If a worker doesn’t heartbeat or a ApplicationMaster times out, the project will restart the application and resume training from checkpoints. This behaviour is meant to make TonY more fault tolerant than competing projects. Since the latter were also lacking in functions to work with GPUs, GPU scheduling is another one of TensorFlow on YARN’s features. To facilitate debugging and optimising, it also supports TensorBoard.
TensorFlow on YARN is available on GitHub under a BSD 2-Clause License. LinkedIn and GitHub both belong to Microsoft, TensorFlow was initially developed at Google.