Whether you are using Hadoop yourself or planning to have someone in your team using it, you should know how it works. Take this exercise to run your very first MapReduce job.
What is Hadoop?
First, let's look at the basics. Hadoop is an open-source framework that allows the processing of datasets using a simple programming model for deploying and processing the data. The key is that it lets you process this data in a clustered environment while using commodity hardware. You can start with a single machine during development and add machines without increasing complexity. This is referred to the ability to scale out. Hadoop uses concepts from functional programming (another hot phrase) to handle large datasets. Finally, it has a built-in model to handle failures within the cluster to ensure the data is processed even if machine failure occurs.
Why Hadoop?
The simple answer is it provides a method to process large amounts of data using a clustered and scalable architecture. It provides a means to process data when the amount, complexity, or unstructured nature of the data overwhelms conventional methods. This is because it was built from the ground up to work in a clustered environment; as the amount of data grows, computing power can be added simply through configuration.
Finally, though other models address these same challenges, Hadoop has become a standard with commercial backing, a growing knowledge base, good documentation, and a growing talent pool. Put together, these reasons make Hadoop a viable alternative, and they are why it is being used with ever increasing frequency.
Components of Hadoop
To understand Hadoop, you need to understand the basic building blocks of the framework. The first component, Commons, contains (as the name implies) common functionality. In addition, Hadoop has its own file system, HDFS, which is made to be fault tolerant and supplies the cornerstone to let it run on commodity hardware.
The final component of the Hadoop system is MapReduce. It implements the model that allows data to be processed in a parallel manner. In the future, we will implement our own code in Java for this layer. Note that the upcoming 2.x release contains a v2 implementation of MapReduce called YARN (but I won't go into that here, as it is still an Alpha version at the time of this writing).
Try it out
Now that we have laid the scene for using Hadoop and we have quickly covered a few introductory concepts, it is time to get the software. The Hadoop stable release can be found here. At the time of this writing, this will download version 1.0.4.
Once it is extracted, you will need to go to the conf directory and set up JAVA_HOME in hadoop-env.sh. From the base directory where it is installed, create a directory called input, and copy some text files into the directory. (I used a text version of my resume.) Then run the following.
bin/hadoop jar hadoop-examples-1.0.4.jar wordcount input output
This will run the MapReduce job and create the output directory containing an output file with the result. Running in this mode, we are not using HDFS, but rather the local file system. You will get a chance to explore HDFS soon enough.
Congratulations. You just ran your first MapReduce job in standalone mode.
MDMConsult,
User Rank: Exabyte Executive 11/30/2012 | 12:41:35 PM
Re: A great first step... Cultivating a data driven culture is what is important today. Areas of management should be able to evaluate current and potential Big Data approaches like Hadoop for consideration. IT and management should come across to the entire organization to develop a plan and business strategy approaches.
smkinoshita,
User Rank: Exabyte Executive 11/29/2012 | 2:00:37 PM
Re: A great first step... @Saul -- Love the anecdote about the open-source gaming networks and BBS. We're re-entering those pioneering days again, aren't we?
I think it's important for more than just the I.T. department to understand Hadoop, but the business side needs to have a basic understanding as well, if only to understand the limitations and possibilities involved.
legalcio,
User Rank: Exabyte Executive 11/29/2012 | 11:53:12 AM
Re: A great first step... Saul, maybe not better support, because individual members can be wrong, but certainly more enthusiastic, creative, and probably more timely support comes from user communities. This blog is the best I've read about removing the veil from Hadoop. Hadoop for Dummies next?
brianeno,
User Rank: Blogger 11/29/2012 | 10:37:52 AM
Re: Basics, good move Look forward to hearing your results. Hadoop is a little touchy under Windows using cygwin, but haven't had any issues using a MAC or Linux. I imagine someone has built a Windows installation in their offering (like Datastax did with Cassandra), though I haven't tried any yet.
Saul Sherry,
User Rank: Blogger 11/29/2012 | 10:33:51 AM
Re: A great first step... Back in the days of open source gaming, the community spirit and help (in the early days of BBS bulletin boards) far outstripped that of the commercial orgs. Is there a case that open source environments actually offer better support? More people at the same coal face? I always find that sense of community to be massively beneficial (and many organizations are looking to leverage such community spirit for their commercial products to cut down on call centre costs).
brianeno,
User Rank: Blogger 11/29/2012 | 4:12:22 AM
Re: A great first step... That is an interesting question that is often asked by many users or potential users of open source software. I think it is comparable. With open source there are two types of support you receive. The first is from the community itself, where in the case of popular frameworks like Hadoop is very good. The second is from companies who build their business model around these frameworks and charge for added functionality (e.g. administration tools) and for support. For Hadoop there are a group of companies like Cloudera, MapR and Hortonworks (there are many, many more, I just chose these from the list).
So in summary, yes I do believe that the level of support is as good as can be found in othert "closed source" environments.
pauls,
User Rank: Bit Player 11/28/2012 | 8:31:18 PM
Re: Basics, good move This is all good stuff... Free course materials and now a tutorial on Hadoop. I am looking forward to checking things out and giving this a try. I am looking forward to more posts like this one.
alvb1227,
User Rank: Petabyte Pathfinder 11/28/2012 | 8:29:53 PM
A great first step... Excellent article @Brian. As someone who is always looking to learn more about programming and data management, this type of striaghtforward information always seems to be hard to find and understand. I certainly look forward to reading more!
I am curious to hear your thoughts on the community approach that is taken when it comes to open-source platforms. Do you feel that the "support" (for lack of a better term) is better/worse/the same as standardized offerings, like Microsoft.
brianeno,
User Rank: Blogger 11/28/2012 | 1:40:05 PM
Re: Basics, good move Daniel, I am pleased you like this model or presenting Hadoop. There will be more, but it is not officially a "series". However we do hope to introduce people to additional concepts around Hadoop.
Join this radio show to truly understand what a CIO needs to do to build a successful private cloud and what skills and values the IT team will need to embody.
To save this item to your list of favorite Big Data Republic content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.