Thursday, 8 November 2012

Hadoop installation in fully distributed mode



Compatibility Requirements

S.No
Category
Supported
1
Languages
Java, Python, Perl, Ruby etc.
2
Operating System
Linux (Server Deployment) Mostly preferred, Windows (Development only), Solaris.
3
Hardware
32 bit Linux ( 64 bit for large deployment )
     


Installation Items


S.No
Item
Version
1
jdk-6u25-linux-i586.bin
Java 1.6 or higher
2
hadoop-0.20.2-cdh3u0.tar.gz
Hadoop 0.20.2

Note: Both Items are required to be installed on Namenode and Datanode machines

    Installation Requirements

S.No
Requirement
Reason
1
Operating system – Linux recommended for server deployment (Production env.)

2
Language – Java 1.6 or higher

3
Ram – at least 3 GB/node

4
Hard disk – at least 1 TB
For namenode machine.
5
Should have root credentials
For changing some system files you need admin permissions.


    

High level Steps

Step #
Activity
Check
1
Binding IP address with the host name under /etc/hosts

2
Setting passwordless SSH

3
Installing Java

4
Installing Hadoop

5
Setting JAVA HOME and HADOOP HOME variables

6
Updating .bash_profile file for hadoop

7
Creating required folders for namenode and datanode

8
Configuring the .xml files

9
Setting the masters and slaves in all the machines

10
Formatting the namenode

11
Starting the Dfs services and mapred services

12
Stopping all services




Binding IP address with the host names
Before starting the installation of hadoop, first you need to bind the IP address of the machines along with their host names under /etc/hosts file.
First check the hostname of your machine by using following command :
$ hostname

Open /etc/hosts file for binding IP with the hostname
$ vi  /etc/hosts

Provide ip & hostname of the all the machines in the cluster
e.g: 10.11.22.33   hostname1
10.11.22.34    hostname2

Setting Passwordless SSh login      
SSH is used to login from one system to another without requiring passwords. This will be required when you run a cluster, it will not prompt you for the password again and again.

First log in on Host1 (hostname of namenode machine) as hadoop user and generate a pair of authentication keys.  Command is:

hadoop@Host1$ ssh-keygen   –t   rsa
Note:  Give the hostname which you got in step 5.3.1. Do not enter any passphrase if asked.
Now use ssh to create a directory   ~/.ssh as user hadoop on Host2 (Hostname other than namenode machine). 

hadoop@Host1$   ssh  hadoop@Host2  mkdir   –p   .ssh
hadoop@Host2’s password:  

Finally append Host1's new public key to hadoop@Host2: .ssh/authorized_keys and enter Host2's password one last time:

hadoop@Host1$ cat   /home/hadoop/.ssh/id_rsa.pub | ssh hadoop@Host2  ‘ cat >> .ssh/authorized_keys ’
hadoop@Host2’s password:

From now on you can log into Host2 as hadoop from Host1 without password:
hadoop@Host1$ ssh  hadoop@Host2
Host2@hadoop$


NOTE: Do the following changes:  
·         Change the permissions of .ssh to 700
·         Change the permissions of .ssh/authorized_keys to 640


Prepare for installation

Check for previous installed versions of java and hadoop on your machine
$ rpm –qa | grep java
           
It will display fully qualified paths of the version installed.

Remove all the previous version of Java and Hadoop installed on the machine.
$ rpm  –e   softwarename or path-name

NOTE: All the installations and extractions are being done in /home/hadoop/

Installing Java
Use the JDK bin file (jdk-6u25-linux-i586.bin) for installing java on your machine . Copy the .bin file in /home/hadoop/                                     

Execute the command “./jdk-6u25-linux-i586.bin” in /home/hadoop/ (which will unzip the contents into folder jdk1.6.0_25)


Extract the hadoop package

Syntax
$ tar  –xzvf  < hadoop-tar-package>
$ tar  –xzvf  hadoop-0.20.0-cdh3u9.tar.gz           

Configuring HADOOP_HOME
Check whether HADOOP_HOME is set up to the folder containing hadoop_core_VERSION.jar using
$ echo   $HADOOP_HOME     

If not set then set it 
$ export HADOOP_HOME=/home/hadoop/hadoop-version
For e.g.
$ cd /home/hadoop/
$ export HADDOP_HOME= /home/hadoop/hadoop-0.20.2-cdh3u0/


Setting JAVA_HOME
$ cd  /home/hadoop/hadoop-0.20.0-cdh3u0/conf/
$ vi  hadoop-env.sh


 hadoop-env.sh file for setting JAVA_HOME
Press :wq to save and exit the file

You need to change the bash file also .
$ vi   ~/.bash_profile




Check for hadoop installation confirmation     
Run hadoop command to confirm whether the installation is successful.

$ cd <hadoop-home-directory>
Standard Path
$ cd /home/hadoop/ hadoop-0.20.0-cdh3u0/
$ bin/hadoop
On successful installation you should get the following message.


CONFIGURING HADOOP IN FULLY DISTRIBUTED MODE


Create the dfs.name.dir local directories on namenode machine

$ cd /home/hadoop/

 $ mkdir -p data/1/dfs/nn

Creating the directories for storing the Data blocks and the temporary directory for storing process ids on datanode machines

$ cd /home/hadoop/

$ mkdir –p  data/1/dfs/dn   data/2/dfs/dn   data/3/dfs/dn

$ mkdir –p /home/hadoop/  tmp

Creating the directories for storing the temporary data (Task Tracker) and the system files for Map/Reduce jobs

$ cd /home/hadoop/

$ mkdir –p  data/1/mapred/local  data/2/mapred/local  data/3/mapred/local

$ mkdir  –p  /home/hadoop/mapred/system

Give full permission to all folder under /home/hadoop/
$ cd /home/hadoop/

$ chmod 777 *

Navigate to /home/hadoop/hadoop-0.20.0-cdh3u0/conf directory

$ cd /home/hadoop/hadoop-0.20.0-cdh3u0/conf   

Set up the configuration files under /home/hadoop/hadoop-0.20.0-cdh3u0/conf/

Core-Site.xml
$ vi core-site.xml

Parameters of core-site.xml
fs.default.name à URL for the Name Node
hadoop.tmp.dirà URL for the temporary data
dfs.replicationà This property specifies the number of times the file has to be replicated on cluster.



hdfs-site.xml
$ vi hdfs-site.xml

Parameters of hdfs-site.xml

dfs.name.dirà This property specifies the directories where the NameNode stores its metadata and edit logs. Represented by the /home/hadoop/data/1/dfs/nn path examples.

dfs.data.dirà This property specifies the directories where the DataNode stores blocks. Represented by the /home/hadoop/data/1/dfs/dn, /home/hadoop/data/2/dfs/dn , /home/hadoop/data/3/dfs/dn  


Press :wq to save and exit the file

mapred-site.xml
$ vi mapred-site.xml

Parameters of mapred-site.xml

mapred.local.dir à This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running Map Reduce jobs.
Eg./home/hadoop/data/1/mapred/local,/home/hadoop/data/2/mapred/local, /home/hadoop/data/3/mapred/local.

mapred.system.dir  àPath on the HDFS where the Map Reduce framework stores system files   e.g. /home/hadoop/mapred/system/.

mapred.job.tracker à Host or IP and port of Job Tracker.



Press :wq to save and exit the file
Set the correct owner and permissions of the local directories:
Directory
Owner
Permissions
dfs.name.dir
hdfs:hadoop
drwx------
dfs.data.dir
hdfs:hadoop
drwx------
mapred.local.dir
mapred:hadoop
drwxr-xr-x



$ chmod 700 /home/hadoop/data/1/dfs/nn/ 
$ chmod 700 /home/hadoop/data/1/dfs/dn/   /home/hadoop/data/2/dfs/dn/ /home/hadoop/data/3/dfs/dn/
$ chmod 755 /home/hadoop/data/1/mapred/local/  /home/hadoop/data/2/mapred/local/ /home/hadoop/data/3/mapred/local/

Setting up the masters and slaves 
vi conf/masters
hostname of machine acting as a SecondaryNamenode

vi slaves
hostname of machines acting as a Datanode & TaskTrackers


Formatting the namenode
You need to format the namenode every time you start the dfs services. This is because every time you start the services it causes some files to be written in the namenode folder which may get duplicated when you run the services for the second time. Do not format a running Hadoop namenode, otherwise it will cause all your data in the HDFS filesytem to be erased.

$ cd /home/hadoop/hadoop-0.20.0-cdh3u0/

$ bin/hadoop namenode -format

Note : Give “Y” when it asks for re-format

Starting dfs service
Run the command   
$ /bin/start-dfs.sh on the machine you want the namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file.

$ cd /home/hadoop/hadoop-0.20.0-cdh3u0/

$ ./bin/start-dfs.sh

NOTE: For any problems check the log files under all the machines under /home/hadoop/hadoop-0.20.2-cdh3u0/logs/ and refer to the troubleshooting guide for the same.

Starting mapred service

For mapred services run the following command on the machine you want jobtracker to run on (in my case it was namenode machine)
you can choose other machine also.

$ ./bin/start-mapred.sh


Checking the DFS service report


$  ./bin/hadoop   dfsadmin   –report

Checking on web interface DFS SERVICE

http://ip-address of namenode machine:50070/

Checking on web interface Mapred Job

http://ip of namenode machine:50030/


Stopping dfs and mapred services

$ cd /home/hadoop/hadoop-0.20.0-cdh3u0/

$ ./bin/stop-all.sh





No comments:

Post a Comment