If there was any doubt that the Apache Hadoop platform has captured the hearts and minds of big data believers everywhere, the recent Hadoop Summit in San Jose on June 26 and 27, 2013, may have settled the question once and for all.
What I took away from the event is that Hadoop is not only alive and well, it is moving forward at light speed. I can say this as I witnessed the energy myself -- the new Hadoop 2.0 technology is advancing in leaps and bounds, the quality of use cases is extraordinary, and the platform’s vendor ecosystem is innovating at a rapid pace.
This was the 6th annual Hadoop Summit (plus another summit held in Amsterdam in early 2013), attracting over 2,500 attendees up from the 2006 event with just 200. The conference was co-hosted by Hortonworks, Inc. and Yahoo.
The event was all that a technical conference should be: a great location in Silicon Valley, an accommodating venue in the San Jose Convention Center, sizzling hot technology, an abundance of enthusiastic techies, great geek perks like the MapR hats and bean bag giveaways, excellent breadth of technical sessions, probing keynote talks, and a rocking party at the Tech Museum. The excitement reminds me dearly of the early JavaOne conferences held at San Francisco’s Moscone Center.
My favorite technical session was “Enabling R on Hadoop,” which discussed the RHadoop technology for optimizing R workloads for use on Hadoop. This directly relates to my work as a data scientist, so I’m quite excited about the direction both R and Hadoop are going.
The Hadoop Summit was all about extending the Hadoop disruptive data processing framework with new elements that encompass an already wildly successful technology stack. Collectively, the next generation Hadoop platform is called Hadoop 2.0. Here’s a short list of new technical directions highlighted at the summit:
Ambari -- aimed at making lives of Hadoop operators, users, and integrators simpler by providing a management interface for deploying, configuring, and managing large Hadoop and HBase clusters.
Giraph -- performs offline, batch processing of very large graph datasets on top of a Hadoop cluster.
Tez -- the next generation query processing framework for Hadoop written on top of YARN.
Falcon -- a new data processing and management platform for Hadoop.
Use cases abound
The second day keynote session featured a number of compelling use case examples of the Hadoop architecture. For example, cycling champion Sky Christopherson described how the US Women’s Cycling Team used a Hadoop-based big data implementation to help win a Silver medal in the 2012 London Olympics by integrating, analyzing, and visualizing sensor and device data critical to making informed training decisions.
Other compelling use cases highlighted at the summit included: Hadoop for High-Performance climate analytics by the NASA Center for Climate Simulation’s mission (NCCS), Hadoop enabled analytics for law enforcement and national security, online dating site eHarmony using Hadoop for making love connections, using Hadoop in healthcare for storing and managing vital-sign data, and much more.
The Hadoop ecosystem
I really enjoyed the panel discussion on the second day of the show where a group of Hadoop vendors discussed how they were helping push the Hadoop ecosystem forward. Led by Teradata and Microsoft and many others, application vendors are waking up to the reality that their applications must run on Hadoop.
Already, it seems everyone is building a reference architecture that incorporates Hadoop and HDP to leverage all the goodness they already provide around data lifecycle management, data governance, security, etc. Meanwhile the Hadoop community is doing everything it can to foster adoption by the ISVs. As a member of the press, I was bombarded by hyperactive PR folks reminiscent of the COMDEX days years ago.
Here is a short list of important announcements from the more than 60 big data market players at the show:
Continuuity -- development tools now support batch processing
Datameer -- debuts new release of big data analytics software
Hortonworks -- offers community review of HDP 2.0 with YARN
Kognito -- introduces version 8 of its analytic platform
WANdisco -- unveils new Hadoop distribution, high availability software
Zettaset -- demonstrates support for latest Cloudera, Hortonworks platforms
The State of the Elephant
One overriding theme of the conference was how Hadoop is “crossing the chasm” as it is now enjoying mainstream adoption with the emergence of vertical solutions.
To share some of the excitement, you can listen to the Day 1 and Day 2 opening remarks to get a pulse of the conference for yourself.
Re: The growing elephant Further breakdown of the types of data mining should be appearing in the wider public consciousness in the next few years. At the moment, Hadoop is often (wrongly) used to define any kind of quick or huge data processing, be it involving correlative, causation based, real time, etc etc.
Obviously that's not the way it's seen inside the data teams, but hopefully the understanding of broader churches will start to spread into the way media etc discuss these areas.
User Rank: Exabyte Executive 7/4/2013 | 8:39:07 AM
Re: The growing elephant Nice article. With Hadoop 2.0 on the horizon, the data reservoir will continue to grow as more types of data will be available for analysis. If it lives up to its promises, a lot of data will suddenly be more accessible within the native Hadoop platform, which will greatly streamline and speed up the task of finding useful information.
User Rank: Exabyte Executive 7/3/2013 | 9:58:38 PM
Re: The growing elephant Nice article. I do believe that Hadoop is alive and well. In fact, I'm pretty bullish on the idea that any company that is collecting data will eventually use this system. Data science is going to be big, I just hope that we are educating enough of them in university settings. I've heard that the government is already using it, so I'm sure there are a ton of PhD candidates out there working on their thesis that are utilizing it.
User Rank: Petabyte Pathfinder 7/3/2013 | 8:58:06 PM
Re: The growing elephant @Danile thanks for a great article, Hadoop is getting much attention these days, its the de facto processing infrastructure for todays' big data, it designed to solve big data problems, but there still a great deal of confusioon about its strengths and weakenesses, its built on foundation which severely limit its ability to act as analytic database.
Re: The growing elephant I fully expect interest in Hadoop, and therefore attendance at the Summit, to continue to grow. This is one of the hottest areas of big data and it was exciting to witness.
As far as breadth of the attendees, the conference wisely published a smart phone app for navigating the show and each attendee with affiliation was included in the database. I went through it carefully and found an impressive list of corporate America and the world. A lot of big firms are behind this technology.