Now that the hype is dying down, Hadoop is increasingly challenged as slow and limited in its application. Marketing departments wildly exaggerated Hadoop's ability.
Surprisingly, it's about to prove them somewhat right.
Leaps in accessibility, but not performance
A few years ago, writing software for Hadoop was slow and cumbersome. You had to write Java software, and you had to understand how to fit it into the MapReduce framework. Luckily, this is the past. Today, we can extract, transform, and load data with a plethora of tools, making development much easier and faster.
On the lower level, Crunch is making Java-based MapReduce programming and testing easier. Pig lets you write data transformation in a high level data flow expression, abstracting away low level programming scaffolds. Hive brings SQL-like abilities to Hadoop, opening the data up to a large number of people. Alternative frameworks, especially around Python, emerged, like Hadoop streaming, mrjob, dumbo, hadoopy, and pydoop to name a few.
Accessibility and ease of programming of MapReduce has leaped forward in the last few years.
Projects like HBase, HCatalog, Zookeeper, Mahout, and others expanded the Hadoop ecosystem, catering to a wide variety of use-cases to fully utilize Hadoop's HDFS and MapReduce core. That core, however, did not change significantly, and has become the bottleneck for new use-cases around cluster computation and resource management. It's at the center of the criticisms of Hadoop being slow and limited.
There are some projects around the corner that will change this, and open Hadoop to new applications. Most importantly, Hadoop will become much faster and more flexible.
The change is coming from two directions. First, two data querying services building on top of Hadoop but bypassing MapReduce have emerged: Impala (in beta and built by Cloudera), and Drill (in development and supported by MapR). They draw their inspiration from Google's Dremel paper and the subsequent Google BigQuery service.
The latter delivers interactive querying of billions of rows of data, often returning results faster than current Hadoop clusters can fire up the first of many MapReduce jobs resulting from an equivalent Hive query.
The speed is achieved by localizing query execution on, or close to, the cluster nodes as much as possible, exchanging as little data as late as possible between nodes, while accessing the data on the nodes directly without incurring MapReduce job overheads. Columnar-oriented storage is introduced to reduce the data loaded for a query. This can often greatly reduce data loading, since full table and row scans are rare in practice.
Localizing the execution takes advantage of local caches and fast memory, reducing the slow and expensive network interactions.
I had a chance to chat with a Google engineer involved with BigQuery at a conference last year. He divulged that BigQuery works not only because of the architectural changes, but also because of specialized hardware -- super fast, specialized, tightly-coupled network switches between a very large number of nodes. Consequently, BigQuery performance is unlikely to be matched by small to medium-sized companies for a long time, despite the achievements made with Impala and Drill.
I wouldn't be surprised to see large IaaS providers like Amazon fill the gap in a year or two with an Impala/Drill service on optimized hardware, following the successful Elastic MapReduce service.
The second development has been an overall reinvention of Hadoop, which is covered in Hadoop 2.0 & Beyond: Reinventing Hadoop. In the meantime, bypassing MapReduce is the first important step to making Hadoop as useful as possible.
Re: So long MapReduce? I agree, the convergance is happening. With Hive/Drill/Impala data warehouse and HPC solutions might see a competition. If we can explore data on Hadoop and use it as a generic cluster resource manager then smaller data warehouses may be a result. HPC may also become an increasingly niche product.
User Rank: Bit Player 3/5/2013 | 6:51:32 AM
Re: So long MapReduce? @Christian, great post. I think you hit the nail on the head; this is a natural evolution.
What MR did was highlight the inherent weakness of other data processing methedologies for processing large amounts of data in an environment where inexpensive scale-out was the only way to go because a scale-up solution was infeasible/impractical/impossible.
MR has done for conventional ETL the same thing that MPP did for databases. Interestingly there seems to be a convergence on the horizon.
Re: So long MapReduce? For developers a more generic cluster resource management as YARN provides (see 2nd part coming up soon) is very exciting. It gives us a framework to deploy any kind of distributed workload, e.g. simple tasks like web crawlers or workers ingesting a queue and so on.
User Rank: Exabyte Executive 2/28/2013 | 11:01:13 AM
Re: So long MapReduce? Nice article Christian. It's good to see Hadoop is maturing. With the emergence of Impala and Drill we will most likely see new languages or extensions to existing languages that will make working with Hadoop easier for developers.
Re: So long MapReduce? I think it is natural evolution after old shcool batch oriented MR has been well understood and utilised. Now people want to get to the data faster and in more ways. So breaking up Hadoop into a more flexible cluster resource control mechanism (see 2nd part) is a good step. And if you have all these nodes with (potentially) free memory and CPU cycles hanging around the question is how can we use it to get to data faster. Lastly, the idea of streaming and online models and updates are a big push at the moment.
So long MapReduce? Great article @Christian. The concept of bypassing MapR is sort of mind blowing. It feels like the majority of big data querying technologies stem from this particular way of returning data results.