A new MapReduce replacement drastically reduces Hadoop query times, fulfilling end user needs.
If big data is hype, then Hadoop is reality. As Robert Plant explains in Hadoop: Executives & Managers Need to Read the Warning Labels, Hadoop is not something every business can or should plug into its infrastructure. It’s not to be taken lightly; you need layers of other technologies to make it work -- and, most importantly, the right skill set to create the magic.
Making the magic happen faster
Well, fellow big data wizards, Cloudera may have dropped a spanner in our magical kingdom with its aptly named Impala. Impala is billed as a fully distributed query engine with SQL like syntax that runs on top of Hadoop Clusters, is faster, and replaces MapReduce.
The business case behind Impala was taken from the Cloudera 2012 Customer Survey (see slide 5):
- 78 percent of users need faster Hadoop queries
- 71 percent move data from RMDBS for interactive SQL
- 62 percent see value in consolidating to a single platform
It’s been validated by a number of big data experts like Big Data Republic blogger Brett Sheppard of Company Tableau.
We’ve had technology that has previously addressed similar issues including Apache Pig and Hive, developed by Yahoo and Facebook respectively and also used by the likes of LinkedIn, Twitter, and AOL. The key difference is all previous open-source attempts have been based on, and restricted by, MapReduce tasks, whereas Impala is based on the same concepts as Google’s query engine Dremel.
Beta testing
Impala looks like the real deal. As soon as we heard the news, one of our developers tried out the Beta release and reported, with wide-eyed concern surrounding his loss of magical powers, “it’s really fast; really, really fast.” It’s impossible for me to give you any exact improvements, so there are no scientific facts behind our findings; we know it could work, but it’s difficult for us to accurately benchmark without a large time commitment. The best set of test results I have seen so far was run by 37signals.
The bench test was run across five different workloads, such as "800 Mb parsed rails log -- slowest accounts." Hive Query Time and MySQL Query Time on this test were returned in 33.2 and 48.1 seconds respectively. Impala returned on this test in one second flat.
10x improvements reported
Another company that used it extensively is Pentaho, which previously used Apache Hive. It didn’t release any actual benchmarks, but BI Scorecard reported Pentaho finding a 10x query performance improvement using Impala over Hive.
You can try it out for yourself: Impala is available for download at GitHub and from the Cloudera site. However, it’s very much in the concept stage at the moment, and we think a more complete version will be released in conjunction with Hadoop’s next release.
I’d just like to finish with a little perspective: Impala won’t be the silver bullet to solve all the concerns around Hadoop’s relationship with your infrastructure and big data in general, but, it does, as I expected, show that steps are being made to bring existing (and common) skill sets closer to big data technology.
Related posts:
— Will Crandle, Co-Founder, JobsTheWord & Servicemarq