Now that the hype is dying down, Hadoop is increasingly challenged as slow and limited in its application. Marketing departments wildly exaggerated Hadoop's ability.
Surprisingly, it's about to prove them somewhat right.
Leaps in accessibility, but not performance
A few years ago, writing software for Hadoop was slow and cumbersome. You had to write Java software, and you had to understand how to fit it into the MapReduce framework. Luckily, this is the past. Today, we can extract, transform, and load data with a plethora of tools, making development much easier and faster.
On the lower level, Crunch is making Java-based MapReduce programming and testing easier. Pig lets you write data transformation in a high level data flow expression, abstracting away low level programming scaffolds. Hive brings SQL-like abilities to Hadoop, opening the data up to a large number of people. Alternative frameworks, especially around Python, emerged, like Hadoop streaming, mrjob, dumbo, hadoopy, and pydoop to name a few.
Accessibility and ease of programming of MapReduce has leaped forward in the last few years.
Projects like HBase, HCatalog, Zookeeper, Mahout, and others expanded the Hadoop ecosystem, catering to a wide variety of use-cases to fully utilize Hadoop's HDFS and MapReduce core. That core, however, did not change significantly, and has become the bottleneck for new use-cases around cluster computation and resource management. It's at the center of the criticisms of Hadoop being slow and limited.
Bypassing MapReduce
There are some projects around the corner that will change this, and open Hadoop to new applications. Most importantly, Hadoop will become much faster and more flexible.
The change is coming from two directions. First, two data querying services building on top of Hadoop but bypassing MapReduce have emerged: Impala (in beta and built by Cloudera), and Drill (in development and supported by MapR). They draw their inspiration from Google's Dremel paper and the subsequent Google BigQuery service.
The latter delivers interactive querying of billions of rows of data, often returning results faster than current Hadoop clusters can fire up the first of many MapReduce jobs resulting from an equivalent Hive query.
The speed is achieved by localizing query execution on, or close to, the cluster nodes as much as possible, exchanging as little data as late as possible between nodes, while accessing the data on the nodes directly without incurring MapReduce job overheads. Columnar-oriented storage is introduced to reduce the data loaded for a query. This can often greatly reduce data loading, since full table and row scans are rare in practice.
Localizing the execution takes advantage of local caches and fast memory, reducing the slow and expensive network interactions.
I had a chance to chat with a Google engineer involved with BigQuery at a conference last year. He divulged that BigQuery works not only because of the architectural changes, but also because of specialized hardware -- super fast, specialized, tightly-coupled network switches between a very large number of nodes. Consequently, BigQuery performance is unlikely to be matched by small to medium-sized companies for a long time, despite the achievements made with Impala and Drill.
I wouldn't be surprised to see large IaaS providers like Amazon fill the gap in a year or two with an Impala/Drill service on optimized hardware, following the successful Elastic MapReduce service.
The second development has been an overall reinvention of Hadoop, which is covered in Hadoop 2.0 & Beyond: Reinventing Hadoop. In the meantime, bypassing MapReduce is the first important step to making Hadoop as useful as possible.
Related posts:
— Christian Prokopp, Data Scientist, Rangespan