Google AI researchers have looked into ways of making reinforcement learning scale better and improve computational efficiency. The result is called SEED RL and can now be explored via GitHub.
SEED stands for scalable, efficient, deep reinforcement learning and describes a “modern RL agent that scales well, is flexible and efficiently utilises available resources”. In their research paper on the project, Lasse Espeholt and his colleagues cite the possibility of training agents on millions of frames per second and lowering the cost of experiments as the approache’s key benefits, potentially opening RL up to a wider audience.
Reinforcement learning is a very use-case specific approach in which agents learn about their environment through exploration and optimise their actions to get the most rewards.
Since the method however needs quite a lot of data to produce good results, distributed learning in combination with accelerators such as GPUs can be a means to achieve that in a more reasonable manner.
Architectures following a similar approach include distributed agent IMPALA, which, compared to SEED RL, supposedly has a number of drawbacks. It for example keeps sending parameters and intermediate model states between actors and learners, which can quickly turn into a bottleneck. It also sticks to CPUs when applying model knowledge to a problem (inference), which isn’t the most performant option when working with complex models, and, according to Espeholt et al, doesn’t utilise machine resources optimally.
SEED RL solves all this by using a learner to perform neural network inference centrally on GPUs and TPUs, the number of which can be changed depending on need. The system also includes a batching layer to collect data from multiple actors for added efficiency. Since the model parameters and the state are kept local, data transfer is less of an issue, while observations are sent through a low latency network based on gRPC to keep things running smoothly.
The SEED RL implementation is based on the TensorFlow 2 API and can be found on GitHub. It uses policy gradient-based V-trace for predicting action distributions to sample actions from, and Q-learning method R2D2 to select an action based on the predictions.
Though their results have to be taken with a grain of salt, as is advised for all research, first benchmarks promise a significant increase of the number of computable frames per second when compared to IMPALA for cases where accelerators are an option. Costs are also meant to reduce in certain scenarios since inference costs are said to be lower when using SEED as opposed to IMPALA’s CPU heavy approach. More details are available on the Google AI blog.