This second development is best summarised by Hortonworks' announcement of the Stinger Initiative last week claiming to make Hive 100 times faster utilising and introducing new core Hadoop technologies. It could result in a near-BigQuery-style performance for a wide range of Hadoop users. That would enable ad hoc and interactive data querying for large datasets, something that requires sophisticated, large, traditional data warehouse setups. Hadoop clusters would instantly solve a highly significant use-case in many companies and potentially become dual use.
The core parts of the Stinger roadmap are the adoption of MapReduce 2.0 in the form of Apache YARN, columnar data storage, and Apache Tez. YARN splits the current rigid Jobtracker into a ResourceManager and ApplicationMaster. They can manage resources like CPU, RAM, and network capacity flexibly across a cluster and for many parallel running applications. For example, a CPU-intense and a memory-intense application could run in parallel on the cluster utilising each node's CPU and memory optimally instead of relying on the current rigid, suboptimal map-and-reduce slot framework.
An interesting development is the release of Corona by Facebook as an alternative to YARN. Corona is proven to work reliably at the upper end of scalability needs. YARN is API-compatible with current MapReduce jobs and is already being delivered with Cloudera's latest distribution (for testing purposes currently, not production). It has to be to see whether YARN or Corona will take over the market or if they will split the market between them.
Apache Tez highly optimises data processing applications, e.g., output of a Hive query or Pig program, by replacing the current paradigm of modelling every application as a directed acyclic graph of multiple MapReduce jobs. Currently, each job requires IO synchronisation and time and resources to be started and managed. Tez can express these applications as a single job with a directed acyclic graph of map-and-reduce tasks in arbitrary order -- for example, a reduce task can feed directly into another reduce task without a (dummy) intermediate map task, IO synchronisation, or overheads for a new job. Lastly, ORCFile, proposed by Stinger as a columnar storage, can dramatically speed up and reduce the data needed to access for an application.
Tez compared to traditional MapReduce
Together these improvements will change the way we perceive and use Hadoop. YARN (or Corona) as a cluster management framework will open Hadoop to many new data processing paradigms. The query acceleration in Hive and Pig makes Hadoop an option for interactive data processing and warehousing, and lowers costs on existing use-cases.
These changes will become part of production systems in the next couple of years, so I am expecting some interesting developments spinning off along the way. At the same time I am not discounting Impala and Drill. They are more focused in their applications and promise even greater performance gains. Hadoop is to here to stay, and there are exciting times ahead.
Re: Real time a real option Yes (http://www.forbes.com/sites/bwoo/2013/02/28/311/) indeed, it basically makes big data accessible to a lot fo people with a common skill - SQL. That makes Hive so attractive. However, in big companies where people are trained on standards and vendors rule the land things have to be full standard compatible (and it sometimes makes sense to be compatible with existing code/queries). So we are seening a big push this year by the big boys to SQL standards and fast/interactive querying on top of Hadoop.
Re: Real time a real option Certainly, full SQL support (on Hive) is not yet there. On the other hand Hive gives you awesome power through custom map/reduce scripts you can inject.
Larger companies sometimes use TerraData to join data in an SQL like fashion from traditional stores like RDBMS with Hadoop and add Tableau or similar to visualise and explore the data. Hadoop is nowhere near replacing such a comprehensive toolset. We can, however, see that with its maturity it eats vertically and horizontally into other products' markets. On the data warehouse core this might become an alternative for some soon - maybe in combination with a lightweight RDBMS for fast access of final aggregated data.
Re: Real time a real option It is not an out of the box solution and wont be for a while. There is a lot of know how required internally and it competes (on the data warehouse side) with feature-rich matured products. So nothing for the faint hearted but an opportunity for daring, well run small companies? You can, however, throw money at the problem and get support and consulting services from distribution suppliers. Time will tell if it is a viable option. It is early days.
Re: Real time a real option Realtime might be a bit too ambitious with Pig and Hive. It certainly makes data warehousing withinteractive querying and fast, scalable processing using Hive/Pig/Impala/Drill/Hue/Beeswax a realistic option. I think getting a 2 in 1 solution with Hadoop - MR & cluster resource management (YARN) as well as viable data warehousing interface - makes it very interesting indeed. Not only are the solutions tighly integrated tit also means one plattform, filesystem and HA/SLA challenge to 'only' worry about. So the investment in hardware can be utilised in two-three ways. Not bad.