Hadoop is fast at tackling big data queries, but it's not fast enough. You need to be aware of the other approaches to getting your big data insight in real-time.
Big data analytics with Hadoop is traditionally a batch process -- the way Hadoop is created relies on long-running steps. The Hadoop client computes the input splits and submits the job; then, with the help of jobtracker and tasktracker, the map tasks are started on these inputs. The output of the map tasks are partitioned and sorted, and after that the reduce tasks are initiated using these intermediate map results. Finally, they will emit the output. It’s not so fast, because it was not meant to be.
There is a strong demand on companies buried under data to process it in realtime to support ad-hoc queries in order to be able to act upon the results fast. This is the area where the key players in big data started exploring new methods and came up with new ways of working with large-scale data.
Google launched a new web service called Google BigQuery in May 2012 to process massive datasets -- up to billions of rows, terabytes of data. Google BigQuery offers various ways to access the service: the most simplistic one is BigQuery Browser (Google account required). It supports DDL like operations (create / share / delete dataset and create / copy / export / delete table) and select statements. The data can be uploaded from the browser in csv format.
As an alternative, Google BigQuery has a command line tool based on Python. You can log in to BigQuery based on OAuth2 authentication; you can display table data and run queries.
Twitter’s Storm is another solution to fill the gap in the big data real-time arena. Though the architecture of Storm resembles the architecture of Hadoop, it takes a different approach.
They introduced the notion of spouts and bolts and streams and topology. The topology defines the nodes in the cluster, which represent the processing logic and the links between the nodes define the data flow between the nodes. Streams are the tuples representing the data to be processed; a spout is a data source and bolts process the data.
The cluster has a node called Nimbus, which can be considered the equivalent of Jobtracker in Hadoop world and has Supervisors -- the worker nodes that have similar functions as the Tasktrackers in Hadoop. More details on Storm concept can be found on Github's Storm Tutorial.
Cloudera is one of the most influential players in big data, key contributor of Hadoop code base, with Doug Cutting, Hadoop creator on the management team. In October 2012, it launched Cloudera Impala as a new component in CDH4 suite. Its approach was to bring real-time query capabilities to Hadoop. The solution is based on Dremel -- this is the same concept used in Google BigQuery -- that was originally published by Google in 2010.
As of writing this article, Cloudera Impala does not support DDL operations -- if it comes to table creations, Cloudera Impala relies on Hive. The supported queries are a subset of Select statements available in Hive and can be run from Impala shell, which is a command line tool written again in Python. Do you see the common pattern emerging here with Google BigQuery shell?
Cloudera Impala is still in beta. According to Cloudera, it does not consider Impala as a replacement for MapReduce but as a complementary approach to MapReduce or Hive. Cloudera offers a free e-learning course on Impala; if you are interested in the details, I recommend taking it.
User Rank: Exabyte Executive 12/30/2012 | 9:24:00 PM
Re: The rivals Looks like most of the alternatives to Hadoop promoted are pretty expensive proprietary hardware based storage vendors. Everyone knows that Hadoop has its flaws but the benefits outweigh the drawbacks.
Re: Outside of Hadoop From a technical standpoint, where are the speed tests that compare processing solutions? Reason being are we ever going to see a one-size-fits-all or will the cloud processing market always be fragmented?
Think of the SQL ecosystem...it's giant, vibrant for decades, and lots of lessons learned/experience to help jumpstart projects. Can/will that level of commitment and understanding happen if we live in a fragmented cloud marketplace?
User Rank: Petabyte Pathfinder 12/26/2012 | 2:56:00 AM
Re: Outside of Hadoop But some companies see Hadoop as technology partner not competition, and most of them has already done some great works with Hadoop (riding on Hadoop), they're doing some sort of "piggy-back thing".
User Rank: Petabyte Pathfinder 12/26/2012 | 2:43:07 AM
Re: Nice to have choices. The dynamics of today's IT environment demand that companies pay attention to how they treat and handle big data challenges, there's a lot of competition in the industry, and this competition demand something big!
User Rank: Petabyte Pathfinder 12/26/2012 | 2:31:52 AM
Re: The rivals
Hadoop is still the best and most widely adopted option for extracting value from big data, it's well-positioned to become the dominant platform for big data management for the next five years Hadoop is all rage in today's tech world, and almost every tech companies used it. But challengers are coming, the closest rival-EMC's Greenplum. Green plum is owned by one of tech world's most prominent company-EMC. That spell trouble for Hadoop, EMC is the overwhelming leader in data storage and it's owned VMware, the virtualization king. I believe that Greenplum has plan or ambition to take that spot away from Hadoop. They're perfectly positioned to launch a massive salvo.