A new MapReduce replacement drastically reduces Hadoop query times, fulfilling end user needs.
If big data is hype, then Hadoop is reality. As Robert Plant explains in Hadoop: Executives & Managers Need to Read the Warning Labels, Hadoop is not something every business can or should plug into its infrastructure. Itís not to be taken lightly; you need layers of other technologies to make it work -- and, most importantly, the right skill set to create the magic.
Making the magic happen faster Well, fellow big data wizards, Cloudera may have dropped a spanner in our magical kingdom with its aptly named Impala. Impala is billed as a fully distributed query engine with SQL like syntax that runs on top of Hadoop Clusters, is faster, and replaces MapReduce.
71 percent move data from RMDBS for interactive SQL
62 percent see value in consolidating to a single platform
Itís been validated by a number of big data experts like Big Data Republic blogger Brett Sheppard of Company Tableau.
Weíve had technology that has previously addressed similar issues including Apache Pig and Hive, developed by Yahoo and Facebook respectively and also used by the likes of LinkedIn, Twitter, and AOL. The key difference is all previous open-source attempts have been based on, and restricted by, MapReduce tasks, whereas Impala is based on the same concepts as Googleís query engine Dremel.
Beta testing Impala looks like the real deal. As soon as we heard the news, one of our developers tried out the Beta release and reported, with wide-eyed concern surrounding his loss of magical powers, ďitís really fast; really, really fast.Ē Itís impossible for me to give you any exact improvements, so there are no scientific facts behind our findings; we know it could work, but itís difficult for us to accurately benchmark without a large time commitment. The best set of test results I have seen so far was run by 37signals.
The bench test was run across five different workloads, such as "800 Mb parsed rails log -- slowest accounts." Hive Query Time and MySQL Query Time on this test were returned in 33.2 and 48.1 seconds respectively. Impala returned on this test in one second flat.
10x improvements reported Another company that used it extensively is Pentaho, which previously used Apache Hive. It didnít release any actual benchmarks, but BI Scorecard reported Pentaho finding a 10x query performance improvement using Impala over Hive.
You can try it out for yourself: Impala is available for download at GitHub and from the Cloudera site. However, itís very much in the concept stage at the moment, and we think a more complete version will be released in conjunction with Hadoopís next release.
Iíd just like to finish with a little perspective: Impala wonít be the silver bullet to solve all the concerns around Hadoopís relationship with your infrastructure and big data in general, but, it does, as I expected, show that steps are being made to bring existing (and common) skill sets closer to big data technology.
User Rank: Blogger 11/30/2012 | 10:27:55 AM
Re: One in the chamber And at what point do we need to be ready to turn off SkyNet? Seriously though, 100% uptime... that would be an achievement in itself, but the tech needed to get us there would be the real benefit. Machines with sight... imagine then unhooking it from internal data and pointing it at the wider world of social media and content... it would be a whole new world of refining results. But gee, what power!
User Rank: Exabyte Executive 11/30/2012 | 9:25:49 AM
Re: One in the chamber That's a good question @Saul. It's probably the next step in the maturity process. At the just concluded Raleigh CIO conference I attended a session with Helen Gu, Ph.D from NC State, who is working on an algorithm to make virtual servers in the cloud self correct memory leaks, which theoretically could lead to 100% uptime, so I'd guess it's a short leap of faith to enable machines to recognize links among vaguely related data lines.
User Rank: Blogger 11/30/2012 | 4:54:11 AM
Re: One in the chamber It's good to see progress being made on this level - it shows a maturity to the industry. We've improved the basics, now lets improve our improvements. Along with speed, do you think the next set of new tool developments will be along the machine learning/recognizing links between disperate sets of data line?
User Rank: Exabyte Executive 11/29/2012 | 1:34:30 PM
One in the chamber Impala may not be a magic bullet, but it's one in the chamber. More will follow. The quest for faster searches will become a cottage industry for Hadoop. More startups and more financing for them isn't a bad thing.