Tuesday, December 16, 2014

Configuring Hadoop on Ubuntu in pseudo-distributed mode


Hadoop is an open-source Apache project that enables processing of extremely large datasets in a distributed computing environment. There are three different modes in which it can be run:

1. Standalone Mode
2. Pseudo-Distributed Mode
3. Fully-Distributed Mode

This post covers setting up of Hadoop 2.5.1 in a Pseudo-distributed mode on an Ubuntu machine. For setting up hadoop on OSx, refer to this post .

Prerequisites


Java: Install Java if it isn’t installed on your system.
Keyless SSH : First, ensure ssh is installed. Then generate the key pairs.
$sudo apt-get install ssh
$ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now ssh into your localhost and allow authorization.
rsync utility:
$sudo apt-get install rsync

Installation


Download Hadoop from the Apache Hadoop site. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hadoop. At the time of this writing, the latest version available is 2.5.2.

Configuration


Every component of Hadoop is configured using an XML file specifically located in hadoop-2.5.2/etc/hadoop.MapReduce properties go in mapred-site.xml, HDFS properties in hdfs-site.xml and common properties in core-site.xml. The general Hadoop environment properties are found in hadoop-env.sh.

hadoop-env.sh
# set to the root of your Java installation
export JAVA_HOME=/usr

# Assuming your installation directory is /user/hadoop
export HADOOP_PREFIX=/user/hadoop
For the rest of this post, we refer to /user/hadoop when we say $HADOOP_HOME.

core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

The Hadoop Distributed File System properties go in this config file. Since we are only setting up one node, we set the value of dfs.replication to 1.
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


Execution


Before starting the daemons we must format the newly installed HDFS.
$ cd $HADOOP_HOME
$ bin/hdfs namenode -format

Start the Daemons:
$ cd $HADOOP_HOME
$ sbin/start-dfs.sh

Monitoring
By default, the web interface for NameNode is available at http://localhost:50070

Check the output of jps
$jps
10582 SecondaryNameNode
10260 NameNode
10685 Jps
10404 DataNode

Running Examples
1. Create the HDFS directories required to execute MapReduce jobs:
$ cd $HADOOP_HOME
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>

2. Copy the input files to the Hadoop Distributed File System
$ bin/hdfs dfs -put etc/hadoop input

3. Run the example provided
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'

4. View the output files on HDFS
$ bin/hdfs dfs -cat output/*

Stop the Daemons:
$ cd $HADOOP_HOME
$ sbin/stop-dfs.sh

Reference : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

No comments :

Post a Comment