With PyTorch-BigGraph, anyone can take a large graph and produce high-quality embeddings with the help of a single machine or multiple machines in parallel. PBG is written in PyTorch, allowing researchers and engineers to easily swap in their own loss functions, models, and other components. Other than that, PBG can also compute the gradients and is automatically scalable.
Facebook AI team states that the standard graph embedding methods don’t scale well and are not able to operate on large graphs consisting of billions of nodes and edges. Also, many graphs exceed the memory capacity of commodity servers creating problems for the embedding systems. But, PBG helps prevent that issue.
PBG performs block partitioning of the graph that helps overcome the memory limitations of graph embeddings. Also, nodes are randomly divided into P partitions ensuring the two partitions fit easily in memory. The edges are then further divided into P2 buckets depending on their source and the destination node. After this partitioning, training can be performed on one bucket at a time.
PBG offers two different ways to train embeddings of partitioned graph data, namely, single machine and distributed training.
- In a single-machine training, embeddings and edges are swapped out in case they are not being used.
- In distributed training, PBG uses PyTorch parallelization primitives and embeddings are distributed across the memory of multiple machines.
Facebook AI team also made several modifications to the standard negative sampling, which is necessary for large graphs. “We took advantage of the linearity of the functional form to reuse a single batch of N random nodes to produce corrupted negative samples for N training edges..this allows us to train on many negative examples per true edge at a little computational cost”, says the Facebook AI team.
To produce embeddings useful in different downstream tasks, Facebook AI team found an effective approach that involves corrupting edges with a mix of 50 percent nodes sampled uniformly from the nodes, and with 50 percent nodes sampled based on their number of edges.
Apart from that, to analyze PBG’s performance, Facebook AI used the publicly available Freebase knowledge graph comprising more than 120 million nodes and 2.7 billion edges. A smaller subset of the Freebase graph, known as FB15k. was also used. As a result, PBG performed comparably to other state-of-the-art embedding methods for the FB15k data set.
PBG was also used to train embeddings for the full Freebase graph where PBG’s partitioning scheme reduced both memory usage and training time. PBG embeddings were also evaluated for several publicly available social graph data sets and it was found that PBG outperformed all the competing methods.
“We..hope that PBG will be a useful tool for smaller companies and organizations that may have large graph data sets but not the tools to apply this data to their ML applications. We hope that this encourages practitioners to release and experiment with even larger data sets”, states the Facebook AI team.
For more information, check out the official Facebook AI blog.