We've shown you how to get Hadoop running on a single machine; now let's look at running Hadoop in a pseudo-distributed mode.
In our first lesson, Hadoop 101, we ran Hadoop in standalone mode, where there was no distributed file system, using a single Java VM (JVM). In pseudo-distributed mode, we can interact with Hadoop in a production-like setting while still installing on a single machine. In this mode, Hadoop processing is distributed over all of the cores or processors on the machine. More importantly for learning Hadoop, this mode employs the Hadoop Distributed File System (HDFS). For now, we can think of HDFS as a file system contained within your native OS file system.
The final mode possible with Hadoop is "fully clustered," where Hadoop interacts across multiple machines. This is obviously where the real power of Hadoop comes to light and how most companies will run their Hadoop implementations. Standalone or pseudo-distributed modes are useful for development and test environments.
There are several configuration steps, but following them is not complicated, and I will supply the values at each step. To do this will modify a few files from the initial installation.
Note these installation instructions pertain to setting up your Hadoop environment on a Mac OS. Linux installation is very similar to Mac, and for Windows you may consider downloading a pre-configured VM, such as the one described on the Yahoo Developer Network, or installing Cygwin.
For this step, all of the files to be modified are within the conf directory within your Hadoop installation.
First, if running on a Mac, you will make some additional changes to the conf/hadoop-env.sh file by adding the following line (visit Stackoverflow.com if you need
an explanation as to why):
The final piece required is to be able to ssh to the localhost without a passphrase. This is needed as our Hadoop daemons will use this for their jvms to communicate with each other. Start by checking if you can ssh to the localhost without a passphrase:
If you are not prompted for a passphrase, you are fine and can skip the next step. If not, you need to execute the following two commands.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Note all of these commands will assume you are in your Hadoop base directory. Also, dependent on the permission setup, you may need to run these using sudo.
Format the file system Since this is the first time we are going to use HDFS, we need to format the file system. Do not worry, this won't affect any of your normal files, only prepare our "internal" file system for use by Hadoop. This creates the HDFS storage directories and the initial version of the persistent data structures contained within it. Running the following command does this:
bin/hadoop namenode -format
You will some output and then your new file system is ready to use with Hadoop!
Starting Hadoop Starting Hadoop with all required processes is simply a matter of running this command, which is supplied with your distribution.
To test that Hadoop is running, we can use the administration console for NameNode and the Job Tracker found here:
MapReduce to test Of course, we will want to run a MapReduce job to fully test our installation. This starts with copying files up to our newly formatted file system to use in our job. Note how we use "fs" in the Hadoop command to address it; otherwise many of the commands will look very familiar for Mac or Linux users. We will copy all the files from our conf directory to a directory called input.
bin/hadoop fs -put conf input
Getting a directory listing is done with this command:
bin/hadoop fs -ls input
Our sample MapReduce job will check these various configuration files and use a regular expression to look for any phrase containing “dfs”:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
You will notice this runs for some time. Normally Hadoop jobs run on very large data sets and not a group of small text files as we have here. Finally, when completed, we can check the output with the following command:
bin/hadoop fs -cat output/part-00000
If you like, you can copy the files to your local file system from HDFS where they can be viewed with this command:
bin/hadoop fs -get output output
That was a lot of setup and configuration! But now you have a fully operating pseudo-distributed Hadoop installation and a distributed file system to work with. Along the way we learned some basic ways to interact with HDFS, run a MapReduce job, and access a couple of web-based administration tools. Now the real fun can begin!
User Rank: Bit Player 3/6/2013 | 2:41:18 PM
Re: Great tutorial! It doesn't hurt, that's for sure. And for some things, this pseudo-distributed mode is more than sufficient. MR's method of distributing the job often reduces the computational complexity of a job a lot and therefore can get done with a lot less computing power.
User Rank: Bit Player 3/5/2013 | 6:46:39 AM
Great tutorial! This is a great tutorial with very practical uses. When developing a new MR program, implementing the whole thing on a single machine is a great way to debug and get many of the kinks ironed out. Thanks for sharing!
User Rank: Blogger 2/28/2013 | 2:29:14 PM
Re: A stepping stone? Hi Saul, pseudo distributed is generally used in development and testing scnearios. it allows interacting with HDFS (and testing any peripheral processes related to that) as well as getting information using the web admin clients.
In production, and the whole reason Hadoop has been such a success, you will be running clustered on a set of commodity hardware. Pseudo distributed is as near as one can get to that architecture, but on one machine.
User Rank: Blogger 2/28/2013 | 2:25:07 PM
Re: A stepping stone? I agree. The issue is more of an issue related to Java and MAC rather than Java and Hadoop. This relationship was in a sort of upheaval and this is one example of extra configuration needed. I would not say the JRE has not been a frequent source of problems though.
Re: A stepping stone? great tutorial -- as a newbie to Hadoop I appreciate these walkthrus.
re: the realm issue in the conf file, it's related to the JRE (supposedly fixed in 1.7) but that concerns me in general...what's oracle's relationship with hadoop at this point? has the JRE been a frequent source of problems for hadoop? i'm curious because it sounds like that's a small hack for a basic problem that shouldn't exist.
User Rank: Blogger 2/28/2013 | 10:39:43 AM
A stepping stone? Great technical insight Brian. Would this Pseudo-Distributed mode only ever be used in a trial scenario, so you can get it to work as it would in situ? Or would it have some real-world use?