Vectorizing Graph Neural Networks (2020)

VHRanger · on July 3, 2023

Yes, people working on graph based ML realize quickly that the underlying data structures most originally academic libraries (networkX, PyG, etc.) use are bad.

I wrote about this before [1] and based a node embedding library around the concept [2].

The NetworkX style graphs are laid out as a bunch of items in a heap with pointers to each other. That works at extreme scales, because everything is on a cluster's RAM and you don't mind paying the latency costs of fetch operations. But it makes little sense for graphs with < 5B nodes to be honest.

Here's the dream (remains to be implemented by someone):

Laying out the graph as a CSR sparse matrix makes way more sense because of data locality. You have an array for edges per node, an index pointer array, and then one matrix with a row per edge for edge data, and one matrix with a row per node for node data. Ideally you code the entire thing with apache arrow memory to ease access to other libraries/languages/

At larger scales, you could just leave the CSR array data on NVMe drives, and you'd still operate at 500mb/s random query throughput with hand coded access, ~150mb/s with mmap.

[1] https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-...

[2] https://github.com/VHRanger/nodevectors

WinLychee · on July 3, 2023

Have seen this coded in practice, holding the entire graph in memory (millions of vertices, billions of edges). It was pretty darn fast.

A question I have wrt memory access for CSR: is the access pattern suboptimal still? While there is less pointer chasing, I believe you are not guaranteed for the edge data of adjacent nodes to be adjacent in memory, thus there is still potential for many cache misses.

Another open question is whether you can further compress the CSR representation. For float-based edge data I think quantization works well, and I believe you can further compress the ROW/COL indices (delta encoding for the former). Compression potentially helps here because you can fit more in cache and can shove more edges into the CPU.

pgera · on July 3, 2023

> I believe you are not guaranteed for the edge data of adjacent nodes to be adjacent in memory

The edge data of a particular node is contiguous, but yes, the edge data of a collection of nodes is not contiguous. You can reorder (permute) the graph for some metric as a preprocessing step so that you get better locality for your average access pattern. This only works for static graphs though.

> For float-based edge data I think quantization works well, and I believe you can further compress the ROW/COL indices

Yes, index compression is pretty well studied and understood. The challenge here is mostly good compression ratio and high decompression performance. There are a couple of works that I'm aware of that do this for gpus. This repo by Mo Sha et al. (https://github.com/desert0616/GCGT) is pretty good, and I also did some work in this space (https://github.com/pgera/efg).

WinLychee · on July 3, 2023

Oh one more question (checked out your repo): why the usage of GPUs here vs CPUs? The possibility of more parallelism when traversing the graph versus CPU?

pgera · on July 3, 2023

Yes, you can get good performance on GPUs due to the parallelism and memory bandwidth. On a new GPU like H100, I believe you can do ~ 50-100 GTEPS (billions of traversed edges per sec) in a BFS. I'm not sure where the state of the art on CPUs is at, but you can certainly do efficient implementations on CPUs too. This paper is a few years old (https://people.csail.mit.edu/jshun/spaa2018.pdf), but has some numbers.

For CPU based index compression/decompression, the webgraph framework is quite mature and widely used and there's also Ligra+ that does it.

WinLychee · on July 4, 2023

Very cool, will digest all this! For my use-case CPU + lots of RAM has been fast enough, and there's a balance between per-thread latency and throughput (serving data concurrently). I'm definitely interested in compressing down the graph data further to see if I can drop latency further, will see if I can adapt some of this. Super neat to see this on GPUs, will check that out too.

Also a fan of https://github.com/frankmcsherry/COST if you've seen it before as well!

WinLychee · on July 3, 2023

Awesome, thanks! Hacking on something in this space atm ;D

VHRanger · on July 4, 2023

> A question I have wrt memory access for CSR: is the access pattern suboptimal still? While there is less pointer chasing, I believe you are not guaranteed for the edge data of adjacent nodes to be adjacent in memory, thus there is still potential for many cache misses.

Yes, you still get cache misses, but much much fewer.

You can also lay the graph array order out to minimize cache misses. This is something I've been looking into, but haven't implemented personally.

For example, the Eytzinger layout makes binary search much faster than a sorted array order [1]. Similarly for graphs, depending on the usecase you have Reverse Cuthill-McKee:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.s...

You can also just embed the graph into 1d, then take the sorted (or Eytzinger) order of the 1d embedding values, which should minimize average traversal cache miss.

> Another open question is whether you can further compress the CSR representation. For float-based edge data I think quantization works well, and I believe you can further compress the ROW/COL indices (delta encoding for the former). Compression potentially helps here because you can fit more in cache and can shove more edges into the CPU.

Yes, that's fairly well studied.

[1] https://algorithmica.org/en/eytzinger

pgera · on July 4, 2023

>You can also lay the graph array order out to minimize cache misses. This is something I've been looking into, but haven't implemented personally.

The issue with RCM is that it only works for undirected graphs, from what I remember. I'm less familiar with GNNs, but if you are trying to optimize for arbitrary BFS type traversals, an easy trick is to just do a few traversals from random sources and average the results. This can be used to reorder the graph and works pretty well in practice.

VHRanger · on July 4, 2023

Right, RCM assumes a symmetric matrix.

Doing something like a single pass of 1d embedding algos like GGVec [1] (note: I wrote the algo, but it's effectively just the GLoVe algo for general graphs) or ProNE (if you have enough RAM) are also very cheap computationally

[1] https://github.com/VHRanger/nodevectors

Epa095 · on July 3, 2023

You might be interested in duckdb-pgq[1], working on implementing graph queries support in duckdb. There are some papers online about it as well if you are interested.

1: https://github.com/cwida/duckdb-pgq

bootsmann · on July 3, 2023

PyG supports sparse adjacency matrices actually, but I remember it being a pain to get right the last time I tried it.

fock · on July 3, 2023

isn't it the default? At least I have the feeling it was in pre-PyG-days.

anonymousDan · on July 3, 2023

Does the design differ for training vs inference/prediction?

VHRanger · on July 3, 2023

Depends on the model application?

If you have to carry around the graph for inference, then you want it compact.

If you just fold the graph into some model, then it doesnt really matter by inference time