Showing posts with label BigData. Show all posts
Showing posts with label BigData. Show all posts

Tuesday, December 23, 2014

Configuring Hive on Ubuntu


Hive facilitates querying and managing large datasets residing in distributed storage. It is built on top of Hadoop. Hive defines a simple query language called as Hive Query language (HQL) which enables users familiar with SQL to query the data. Hive converts your HQL (Hive Query Language) queries into a series of MapReduce jobs for execution on a Hadoop cluster. In this post we will configure Hive on our machine.

Download Hive from the Apache Hive site. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hive. At the time of this writing, the latest version available is 0.14.0.

Prerequisites:
Java: 1.6 or higher. Preferred version would be 1.7
Hadoop: 2.x. For Hadoop installation you can refer to this post.

Installation

Set the environment variable HIVE_HOME to point to the installation directory. You can set this in your .bashrc
export HIVE_HOME=/user/hive

Finally, add $HIVE_HOME/bin to your PATH.
$export PATH=$HIVE_HOME/bin:$PATH

Setting HADOOP_PATH in HIVE config.sh
Append the following line to the file $HIVE_HOME/bin/config.sh.
export HADOOP_HOME=/user/hadoop


Running Hive
You must create /tmp and /user/hive/warehouse and set appropriate permissions before you can create any table in hive.
$ hadoop fs -mkdir /usr/hive/warehouse
$ hadoop fs -chmod g+w /usr/hive/warehouse
$ hadoop fs -mkdir /tmp
$ hadoop fs -chmod g+w /tmp

Start the hive shell
$ hive

The shell would look something like
Logging initialized using configuration in jar:file:/user/hive/lib/hive-common-0.14.0.jar!/hive-log4j.properties
hive >

Reference : https://cwiki.apache.org/confluence/display/Hive/Home

Tuesday, December 16, 2014

Configuring Hadoop on Ubuntu in pseudo-distributed mode


Hadoop is an open-source Apache project that enables processing of extremely large datasets in a distributed computing environment. There are three different modes in which it can be run:

1. Standalone Mode
2. Pseudo-Distributed Mode
3. Fully-Distributed Mode

This post covers setting up of Hadoop 2.5.1 in a Pseudo-distributed mode on an Ubuntu machine. For setting up hadoop on OSx, refer to this post .

Prerequisites


Java: Install Java if it isn’t installed on your system.
Keyless SSH : First, ensure ssh is installed. Then generate the key pairs.
$sudo apt-get install ssh
$ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now ssh into your localhost and allow authorization.
rsync utility:
$sudo apt-get install rsync

Installation


Download Hadoop from the Apache Hadoop site. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hadoop. At the time of this writing, the latest version available is 2.5.2.

Configuration


Every component of Hadoop is configured using an XML file specifically located in hadoop-2.5.2/etc/hadoop.MapReduce properties go in mapred-site.xml, HDFS properties in hdfs-site.xml and common properties in core-site.xml. The general Hadoop environment properties are found in hadoop-env.sh.

hadoop-env.sh
# set to the root of your Java installation
export JAVA_HOME=/usr

# Assuming your installation directory is /user/hadoop
export HADOOP_PREFIX=/user/hadoop
For the rest of this post, we refer to /user/hadoop when we say $HADOOP_HOME.

core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

The Hadoop Distributed File System properties go in this config file. Since we are only setting up one node, we set the value of dfs.replication to 1.
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


Execution


Before starting the daemons we must format the newly installed HDFS.
$ cd $HADOOP_HOME
$ bin/hdfs namenode -format

Start the Daemons:
$ cd $HADOOP_HOME
$ sbin/start-dfs.sh

Monitoring
By default, the web interface for NameNode is available at http://localhost:50070

Check the output of jps
$jps
10582 SecondaryNameNode
10260 NameNode
10685 Jps
10404 DataNode

Running Examples
1. Create the HDFS directories required to execute MapReduce jobs:
$ cd $HADOOP_HOME
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>

2. Copy the input files to the Hadoop Distributed File System
$ bin/hdfs dfs -put etc/hadoop input

3. Run the example provided
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'

4. View the output files on HDFS
$ bin/hdfs dfs -cat output/*

Stop the Daemons:
$ cd $HADOOP_HOME
$ sbin/stop-dfs.sh

Reference : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

Friday, November 7, 2014

Configuring Hadoop on Mac OSx in pseudo-distributed cluster mode.


Hadoop is an open-source Apache project that enables processing of extremely large datasets in a distributed computing environment. There are three different modes in which it can be run:

1. Standalone Mode
2. Pseudo-Distributed Mode
3. Fully-Distributed Mode

This post covers setting up of Hadoop 2.5.1 in a Pseudo-distributed mode. A Pseudo-Distributed mode is one where each hadoop daemon runs as a separate java process.

Prerequisites


Java: Install Java if it isn’t installed on your mac.
Homebrew: Homebrew is a package manager for Mac. You can find the installation instructions here
Keyless SSH : First, ensure Remote Login under System Preferences -> Sharing is checked to enable SSH. Generate the key pairs.
$ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now ssh into your localhost and allow authorization.

Installation


This is where Homebrew is used.
$brew install Hadoop
If you do not want to use homebrew or you want to install a specific version of Hadoop, you can download it from the Apache Hadoop. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hadoop.

Configuration


Every component of Hadoop is configured using an XML file specifically located in /usr/local/Cellar/hadoop/2.5.1/libexec/etc/hadoop.MapReduce properties go in mapred-site.xml, HDFS properties in hdfs-site.xml and common properties in core-site.xml. The general Hadoop environment properties are found in hadoop-env.sh.

hadoop-env.sh

Replace the existing HADOOP_OPTS with following.
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
If Homebrew was not used to install Hadoop, kindly point the JAVA_HOME to your java installation.

core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

The Hadoop Distributed File System properties go in this config file. Since we are only setting up one node, we set the value of dfs.replication to 1.
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


Execution

Before starting the daemons we must format the newly installed HDFS.
$ cd /usr/local/Cellar/hadoop/2.5.1/libexec/bin
$ hdfs namenode -format

Start the Daemons:
$ cd /usr/local/Cellar/hadoop/2.5.1/libexec/sbin
$ ./start-dfs.sh

Monitoring
Check the output of jps
$jps
10756 NameNode
1282 
10842 DataNode
11022 Jps
10951 SecondaryNameNode
1842 

Alternatively, the web interface for the NameNode can be browsed at http://localhost:50070

Running Examples
1. Create the HDFS directories required to execute MapReduce jobs:
$ cd /usr/local/Cellar/hadoop/2.5.1/libexec/bin
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/<username>

2. Copy the input files to the Hadoop Distributed File System
$ hdfs dfs -put ../etc/hadoop input

3. Run the example provided
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'

4. View the output files on HDFS
$ hdfs dfs -cat output/*

Reference : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

Thursday, August 21, 2014

An Introduction to Zookeeper - Part I of the Zookeeper series

A distributed system consists of multiple computers that communicate through a computer network and interact with each other to achieve a common goal.Major benefits that distributed systems offer over centralized systems is scalability and redundancy. Systems can be easily expanded by adding more machines as needed, and even if one of the machines is unavailable, the service continues to be available. However, such a system comes with its own challenges.

Zookeeper is an open source, high-performance coordination service for distributed applications. It tackles the common challenges that distributed applications face. It exposes common services that are required in the distributed environment like naming, configuration management, group service and provides the solution to distributed synchronization.

Zookeeper can be run in either a single-server mode or cluster(replicated) mode. Running zookeeper in single-server mode does not take advantage of zookeeper’s inherent features of high availability and resilience. Typically in a production environment, Zookeeper is run in a multi-server mode. A zookeeper cluster is called as an ensemble. A leader is elected on service startup. If the leader goes down, a new leader is elected. Clients only connect to a single zookeeper server and maintain a TCP connection. A Client can read from any zookeeper server, however writes go through the leader and needs a majority consensus.

Zookeeper provides sequential consistency guarantee, i.e., updates are applied in the order in which they are sent. It guarantees atomic updates, i.e., the updates either succeed or fail, there are no partial updates. It guarantees that a zookeeper client sees the same view of the service irrespective of the server in the ensemble that it connects to. A server will not accept a connection from a client until it has caught up with the state of the server to which the client was connected previously. Zookeeper ensures reliability, i.e., if an update succeeds in Zookeeper, then it is not rolled back. Zookeeper guarantees timeliness, i.e., a client is bound to see system changes within a certain time bound.

Zookeeper is an eventually consistent system, i.e., it does not guarantee that different clients will have identical view of zookeeper data at every instance in time. But it guarantees that if a follower falls too far behind the leader, then it goes offline.

The distributed processes using Zookeeper coordinate with each other through shared hierarchical namespaces. These namespaces are organized like UNIX file system. More on Zookeeper in the next post.

Sunday, August 3, 2014

Configuration and Coordination with Zookeeper

It took me a while to understand the concept of Zookeeper and it took me another some to understand how to use it for the task that I had begun with. This post is intended to help others cross the bridge faster.
Dynamic Configuration Management for today's system comes with all the nitty-gritties that are involved with a distributed environment. A distributed environment is relatively unreliable, with problems like network failures, clock synchronization, and it becomes the responsibility of each server to keep track of correctness of current configuration. These problems are the motivation behind configuration and coordination systems.

This particular post looks at Zookeeper as a configuration management tool. A short and good course on Zookeeper

Setup: A zookeeper cluster with 5 servers. A java wrapper that uses zookeeper to keep track of configuration changes. For our purposes we used the Curator framework.

Read zookeeper multi-server setup on a single host for setting up zookeeper cluster.

In your java project include the dependency for the curator framework.

Initializing connection to Zookeeper ensemble
public static CuratorFramework createConnection(String zookeeperConnectionString) {
    //First retry will wait for 1 second, the second will wait up to 2 seconds, the third will wait
    //upto 4 seconds.
    RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);
    CuratorFramework client = CuratorFrameworkFactory.builder().connectString(zookeeperConnectionString)
            .retryPolicy(retryPolicy)
            .namespace("session_service")
            .canBeReadOnly(true)
            .zookeeperFactory(new DefaultZookeeperFactory())
            .build();
    client.start();
    return client;
}

Creating a new znode
public static void create(CuratorFramework client, String path) throws Exception {
    client.create().forPath(path);
}

Setting data of a znode
public static void setData(CuratorFramework client, String path, String data) throws Exception {
    byte[] payload = data.getBytes();
    client.setData().forPath(path, payload);
}

Deleting a znode
public static void delete(CuratorFramework client, String path) throws Exception {
    client.delete().forPath(path);
}

Get children and set the given watcher on the node
public static List watchedGetChildren(CuratorFramework client, String path) throws Exception {
    return client.getChildren().usingWatcher(new WatcherImpl(client,path)).forPath(path);
}

Get Data of the node and set a watcher on the node
public static String getData(CuratorFramework client, String path) throws Exception{
    String str = new String(client.getData().usingWatcher(new WatcherImpl(client,path)).forPath(path));
    return str;
}

Zookeeper works on the idea of client setting watches on znodes. Whenever the znode changes, the watch is triggered, that is, the client is notified.

We should have a class that implements the Watcher interface. This class should implement the process method. This is called if and when the corresponding changes occur.
public class WatcherImpl implements Watcher{
    @Override
    public void process(WatchedEvent event) {
        if(event.getType() == Event.EventType.NodeDataChanged) {
     System.out.println("The Data has changed");
        }
        else if(event.getType() == Event.EventType.NodeChildrenChanged){
     System.out.println("Children have changed");
        }
    }
}

Its important to note that watches are one time event and if its required to continue monitoring the changes the watch needs to be set again. All the read operations like getData(), getChildren() and exists() have the option of setting a watch. These watches are triggered on the corresponding changes.

That's all for getting the configuration management with zookeeper up and running.