Google’s AI department has introduced GPipe, a new library for distributed machine learning, to the neural network building framework Lingvo. It makes use of pipeline parallelism to scale up deep neural network training and get more precise systems.
According to Google AI software engineer Yanping Huang, “current state-of-the-art image models have already reached the available memory found on Cloud TPUv2s” so researchers can’t solely rely on hardware to move their models forward. They need a more efficient way of designing their training systems instead, and parallelising pipelines is one way to go.
Speeding up the training of complex artificial neural networks can be done by either splitting up the training data across a higher number of machines or just using graphics or tensor processing units as a training accelerator. Since TPUs are a Google product, making the best of the second approach seems like the reasonable way to go in-house, and GPipe helps with exactly that.
While a naive model parallelism strategy can lead to only one accelerator being active at a time, because of the sequential nature of neural networks, GPipe partitions a model across accelerators and splits batches of training examples into smaller units, so that the execution can be pipelined across them. This lets accelerators work in parallel and speeds up the process.
To make sure the partitioning doesn’t affect the model quality, the library also makes use of synchronous stochastic gradient descent, accumulating gradients across the so-called micro-batches.
In experiments done by Google AI, the researchers were able to measure an almost linear speedup in the training of one of their state-of-the-art AmoebaNets models with an impressive 557 million model parameters and an input image size of 480×480 pixels on Google Cloud TPUv2s.
One of the latter includes eight accelerator cores, each of which has 8GB of memory and can therefore apparently train up to 82 million parameters. During their tests with GPipe, the team is said to have brought that number up to 318 million parameters on a core, so that AmoebaNet was able to incorporate 1.8 billion parameters on a TPUv2 in the end.