Last year, GraphChi, a spin-off of GraphLab, a distributed, graph-based, high-performance computation framework, did something remarkable.
GraphChi outperformed a 1,636 node Hadoop cluster processing a Twitter graph (dataset from 2010) with 1.5 billion edges -- using a single Mac Mini. The task was triangle counting and the Hadoop cluster required over seven hours, while GraphChi on the Mac Mini did it in one hour! The answer lies in ruthless optimization on the algorithm and the benefits of a single machine versus the overhead of a generic cluster setup.
The first advantage of GraphChi is that it can simplify a lot of assumptions and subsequent algorithms by not dealing with distributed processing. Armed with this knowledge and an understanding of the general benefits and disadvantages of single machine performance, the processing steps can be designed. A single computer usually has two characteristics: first, a large graph problem does not fit into the fast RAM (Random Access Memory), and second, it has a large disk, which is large enough to hold the data.
The traditional disks are only performant on sequential and not on random reads. Modern computers may come with solid state disks for faster random read and write, though they are still significantly slower than RAM. Consequently, any algorithm aiming to solve graph problems on single machine commodity hardware has to utilize the disk and minimize the random access to data.
Divide and conquer
Aapo Kyrola, a PhD candidate at Carnegie Melon University, took this knowledge to improve GraphLab, a distributed graph computation framework. His idea was to divide the Graph into disjoint shards, each of which fits into the machine's RAM. The shards then can be processed in parallel in memory. Updates that have to be written to other shards are subsequently done in a sequential update. This minimizes random operations on disk and exploits the machine's RAM and parallel processing abilities.
Parallel Sliding Window processing of split graph (source)
Aapo invented the PSW (Parallel Sliding Window) algorithm to achieve the key performance improvement, the (nearly exclusive) sequential disk behaviour. The PSW sorts all vertices in a shard by source shards. This means that each shard in itself is split into blocks of vertices relevant to the remaining shards.
For example, at interval 1 (see figures above) shard 1 is processed in memory. It contains a subset of the graph vertices' destination values. These destinations are a continuous block of the sorted source values in the remaining shards and can be read sequentially. All updates are computed and stored in place in memory for shard 1 and sequentially written back to the remaining shards, updating the blocks previously read. At the end, the in memory updated version of shard 1 is written sequentially to disk. At interval 2, shard 2 is loaded and the same process is applied to the remaining shards again.
This approach utilizes the architecture of modern commodity computers very well, as some performance tests in the original paper illustrate. For example, striping data on several disks, and using SSD instead of rotational disks, improve the performance but hardly ever more than two-fold because the algorithm optimized away the need for high permanent storage performance. Even the increase of the number of shards has little influence on GraphChi's throughput, promising reliant performance with even larger graphs. Impressively, another sign for the efficiency of the algorithm is that moving the computation from disk to complete in memory does only improve computation time by a factor of 1.1 to 2.5 over SSD.
GraphChi demonstrates paradigm-shifting performance gains compared to alternative generic solutions like Hadoop, Spark, or even the for graph computing highly optimised GraphLab and PowerGraph. The latter is an optimized parallel distributed approach. It can solve the Twitter triangle count problem in 1.5 minutes. However, it employs 64 nodes with eight cores each, for a total of 512 cores. Roughly, a performance improvement of factor 40, thus requiring 256 times the compute power (in cores).
While there are various ways of comparing these two very different approaches and architectural requirements, the take away is that a one magnitude performance improvement required a two magnitude increase in computing resources. As mentioned in the beginning of the article, Hadoop as a generic framework performs very poorly on this task.
GraphLab Inc, a spin-off of this research, received $6.75 million in venture funding to develop products from the GraphLab algorithms. I had a chance of trying out their beta program focused on a cloud- and web-based solution on large scale machine learning problems and I was impressed (watch this space for a future article on this). In the meantime, you can download and compile GraphChi and try it out.
Re: GraphChi wow I agree -- really interested on the hardware points. i didn't realize that hardware improvements were being algorithmically calculated out so easily; makes me want to focus on software optimizations more!
Re: GraphChi Indeed. It is worthwhile to remember that Hadoop is a multi-purpose platform and some problems are solvable (and worth solving) with specificly engineered solutions. At the same time, I am expecting with YARN Hadoop eill become increasingly a platform and sometimes easy/available beats fast. Right tool for the job is only true with the full context.
User Rank: Blogger 10/30/2013 | 5:55:08 AM
Re: GraphChi The current poll not withstanding, it's great that we are closing out the Hadoop conversation. GraphChi is the tip of the Iceburg when it comes to processing power to stick those analytical figures into... we need this conversation to turn away from Hadoop more severely- it's an option ,that's all.