Tuesday, December 23, 2014

Configuring Hive on Ubuntu


Hive facilitates querying and managing large datasets residing in distributed storage. It is built on top of Hadoop. Hive defines a simple query language called as Hive Query language (HQL) which enables users familiar with SQL to query the data. Hive converts your HQL (Hive Query Language) queries into a series of MapReduce jobs for execution on a Hadoop cluster. In this post we will configure Hive on our machine.

Download Hive from the Apache Hive site. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hive. At the time of this writing, the latest version available is 0.14.0.

Prerequisites:
Java: 1.6 or higher. Preferred version would be 1.7
Hadoop: 2.x. For Hadoop installation you can refer to this post.

Installation

Set the environment variable HIVE_HOME to point to the installation directory. You can set this in your .bashrc
export HIVE_HOME=/user/hive

Finally, add $HIVE_HOME/bin to your PATH.
$export PATH=$HIVE_HOME/bin:$PATH

Setting HADOOP_PATH in HIVE config.sh
Append the following line to the file $HIVE_HOME/bin/config.sh.
export HADOOP_HOME=/user/hadoop


Running Hive
You must create /tmp and /user/hive/warehouse and set appropriate permissions before you can create any table in hive.
$ hadoop fs -mkdir /usr/hive/warehouse
$ hadoop fs -chmod g+w /usr/hive/warehouse
$ hadoop fs -mkdir /tmp
$ hadoop fs -chmod g+w /tmp

Start the hive shell
$ hive

The shell would look something like
Logging initialized using configuration in jar:file:/user/hive/lib/hive-common-0.14.0.jar!/hive-log4j.properties
hive >

Reference : https://cwiki.apache.org/confluence/display/Hive/Home

Tuesday, December 16, 2014

Configuring Hadoop on Ubuntu in pseudo-distributed mode


Hadoop is an open-source Apache project that enables processing of extremely large datasets in a distributed computing environment. There are three different modes in which it can be run:

1. Standalone Mode
2. Pseudo-Distributed Mode
3. Fully-Distributed Mode

This post covers setting up of Hadoop 2.5.1 in a Pseudo-distributed mode on an Ubuntu machine. For setting up hadoop on OSx, refer to this post .

Prerequisites


Java: Install Java if it isn’t installed on your system.
Keyless SSH : First, ensure ssh is installed. Then generate the key pairs.
$sudo apt-get install ssh
$ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now ssh into your localhost and allow authorization.
rsync utility:
$sudo apt-get install rsync

Installation


Download Hadoop from the Apache Hadoop site. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hadoop. At the time of this writing, the latest version available is 2.5.2.

Configuration


Every component of Hadoop is configured using an XML file specifically located in hadoop-2.5.2/etc/hadoop.MapReduce properties go in mapred-site.xml, HDFS properties in hdfs-site.xml and common properties in core-site.xml. The general Hadoop environment properties are found in hadoop-env.sh.

hadoop-env.sh
# set to the root of your Java installation
export JAVA_HOME=/usr

# Assuming your installation directory is /user/hadoop
export HADOOP_PREFIX=/user/hadoop
For the rest of this post, we refer to /user/hadoop when we say $HADOOP_HOME.

core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

The Hadoop Distributed File System properties go in this config file. Since we are only setting up one node, we set the value of dfs.replication to 1.
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


Execution


Before starting the daemons we must format the newly installed HDFS.
$ cd $HADOOP_HOME
$ bin/hdfs namenode -format

Start the Daemons:
$ cd $HADOOP_HOME
$ sbin/start-dfs.sh

Monitoring
By default, the web interface for NameNode is available at http://localhost:50070

Check the output of jps
$jps
10582 SecondaryNameNode
10260 NameNode
10685 Jps
10404 DataNode

Running Examples
1. Create the HDFS directories required to execute MapReduce jobs:
$ cd $HADOOP_HOME
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>

2. Copy the input files to the Hadoop Distributed File System
$ bin/hdfs dfs -put etc/hadoop input

3. Run the example provided
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'

4. View the output files on HDFS
$ bin/hdfs dfs -cat output/*

Stop the Daemons:
$ cd $HADOOP_HOME
$ sbin/stop-dfs.sh

Reference : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

Friday, December 12, 2014

Git Basics - A cheat sheet for your daily git needs.

This post is for anyone to refer to for their daily git needs. We will not be covering any advanced git concepts here.


Git is a distributed version control system.
Some basic terminologies:
Directory: A folder that contains multiple files.
Repository: A directory where Git has been initialized to start version controlling your files.

I have created an empty directory called gitBasics on my machine.
$ ls -a
.  ..

Let us initialize an empty git repository.
$ git init
Initialized empty Git repository in /Users/anjana/gitBasics/.git/
$ ls -a
.    ..   .git
As seen above, a hidden .git directory is created inside the the gitBasics, indicating that a repository has been initialized.

Next, lets see the current status of the directory as compared to the repository.
$ git status
On branch master

Initial commit

nothing to commit (create/copy files and use "git add" to track)

Now lets create a file filename.txt in the directory.
$ ls -a
.            ..           .git         filename.txt

Lets check the status again.
$ git status
On branch master

Initial commit

Untracked files:
  (use "git add ..." to include in what will be committed)

 filename.txt

nothing added to commit but untracked files present (use "git add" to track)
git shows that an untracked file is present.

Lets add this file to the staged area.
$ git add filename.txt
$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached ..." to unstage)

 new file:   filename.txt

Next, lets commit these changes.
$ git commit -m"Adding test file"
[master (root-commit) 4b8b52d] Adding test file
 Committer: Shankar 
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 1 insertion(+)
 create mode 100644 filename.txt

At the time of commit, git tries to identify the author of the commit. In order to set this, use the following commands.
$ git config --global user.name "Anjana Shankar"
$ git config --global user.email "***@g***.com"

Now when you run git status, it says that the working directory is clean and there is nothing to commit.
$ git status
On branch master
nothing to commit, working directory clean

Next we have the git log command. This command prints the history of the repository.
$ git log
commit 4b8b52d4071a04c7f98436aae959ab9b10fec2ec
Author: Shankar 
Date:   Thu Dec 11 22:24:28 2014 +0530

    Adding test file

Now lets add the remote origin to our local repo.
$ git remote add origin git@github.com:*****/gitBasics.git

After the remote branch is added, we should push our code to remote git repo. This can be done as follows:
$git push -u origin master

In order to pull from remote branch, use the following command:
$git pull -u origin master

In order to see the differences between the current and the last committed version of code, use the following:
$ git diff HEAD
diff --git a/filename.txt b/filename.txt
index c9e358c..411cdda 100644
--- a/filename.txt
+++ b/filename.txt
@@ -1 +1 @@
-First File
+First File Modified

or you can simply use
$ git diff
diff --git a/filename.txt b/filename.txt
index c9e358c..411cdda 100644
--- a/filename.txt
+++ b/filename.txt
@@ -1 +1 @@
-First File
+First File Modified

A line prepended with '-' shows the deleted lines and a line prepended with '+' shows the added lines.

When we use the git add command, we stage the differences. Lets stage the differences first, and then understand how to unstage and reverse our changes to arrive at the last committed snapshot. I have created another file 'filename2.txt', Committed the file and then made some changes to it.
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

 modified:   filename2.txt

no changes added to commit (use "git add" and/or "git commit -a")
$ git add filename2.txt 
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD ..." to unstage)

 modified:   filename2.txt

To see the staged differences, use the following:
$ git diff --staged
diff --git a/filename2.txt b/filename2.txt
index f686acc..5701cbe 100644
--- a/filename2.txt
+++ b/filename2.txt
@@ -1 +1 @@
-Second File
+Second File Modified

You can unstage the files as follows:
$ git reset filename2.txt
Unstaged changes after reset:
M filename2.txt
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

 modified:   filename2.txt

no changes added to commit (use "git add" and/or "git commit -a")

After unstaging the changes can be undone as follows:
$ git checkout -- filename2.txt
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

Let's talk about branches now:
To create a new branch, use the following:
$ git branch newBranch

Use the following to switch branches
$ git checkout newBranch
Switched to branch 'newBranch'
$ git status
On branch newBranch
nothing to commit, working directory clean

I have modified 'filename2.txt' and pushed changes to this branch.
$ git log
commit 889ab1f0f42e7efd5818f68b30a42ced587db320
Author: Anjana Shankar <*****@gmail.com>
Date:   Fri Dec 12 10:14:05 2014 +0530

    Modified file on the branch

Now lets merge this branch to master. First we will have to switch back to master. Once you are on the master you can merge the branch.
$git checkout master
$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
$ git merge newBranch
Updating 7c4f3ad..889ab1f
Fast-forward
 filename2.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean
$git push
Total 0 (delta 0), reused 0 (delta 0)
To git@github.com:****/gitBasics.git
   7c4f3ad..889ab1f  master -> master

Finally as we are done with the branch, let's delete it.
$ git branch -d newBranch
Deleted branch newBranch (was 889ab1f).
$ git push origin --delete newBranch
To git@github.*****/gitBasics.git
 - [deleted]         newBranch

To see the remote branches available, use the following:
$ git branch -r
  origin/master

That's it in this post. Will try to cover a few advanced git concepts in my next posts.
Reference : Pro Git book