Facebook’s research team has just released PyTorch-BigGraph (PBG), giving those wondering how to quickly process graph-structured data for machine learning purposes a leg-up…and pushing their TensorFlow competitor in the process.
PBG is an optimised system for graph embeddings, which can be used to create vector representations for graph-structured data, which is mostly easier to work with. Embeddings like this have shown to be useful for recommendation or prediction tasks for example.
Big-Graph is meant to work with exceptionally large and complex graphs spanning billions of nodes and trillions of edges, as is common in Facebook’s graph data for example, but can also be found in other web offerings with networking capabilities such as YouTube or Twitter.
A common problem in this area is scaling the available resources and especially the providing of memory needed to make the most of such amounts of data. PBG is able to partition graphs to train large embeddings on single machines or distributed environments without having to load everything into memory.
It can also make use of multi-threading and batched negative sampling which should increase the efficient use of memory as well and help with speed, which is a second bottleneck often encountered.
According to the team, the quality of the embeddings generated should be similar to other embedding systems, while needing less time in the training stage. Training consists of the ingestion of a list of edges, each of which is described by a source, a target, and a relationship – if available. The output comes in the form of feature vectors for all entities, with adjacent ones placed as close together as possible and pushing the unconnected ones away, which leads to a clustering of sorts.
PyTorch-BigGraph relies on Python and PyTorch, which is maintained by Facebook, as well as a few other libraries and is BSD licensed. To get started, the GitHub repository contains some example scripts and pretrained embeddings. Facebook’s AI team hopes the open sourcing of the system will get other companies to release larger graph data sets and therefore facilitate research in that area.
More information on the model used and the math behind the approach can be found in the associated paper (PDF), which was just introduced at the conference on Systems and Machine Learning SysML in Stanford, California.