Compatibility
Requirements
|
S.No
|
Category
|
Supported
|
|
1
|
Languages
|
Java, Python, Perl, Ruby etc.
|
|
2
|
Operating
System
|
Linux (Server
Deployment) Mostly preferred, Windows (Development only), Solaris.
|
|
3
|
Hardware
|
32 bit Linux ( 64 bit for large deployment )
|
Installation Items
|
S.No
|
Item
|
Version
|
|
1
|
jdk-6u25-linux-i586.bin
|
Java 1.6 or higher
|
|
2
|
hadoop-0.20.2-cdh3u0.tar.gz
|
Hadoop 0.20.2
|
Note:
Both Items are required to be installed on Namenode and Datanode machines
Installation Requirements
|
S.No
|
Requirement
|
Reason
|
|
1
|
Operating system – Linux recommended for server
deployment (Production env.)
|
|
|
2
|
Language – Java 1.6 or higher
|
|
|
3
|
Ram – at least 3 GB/node
|
|
|
4
|
Hard disk – at least 1 TB
|
For namenode
machine.
|
|
5
|
Should have root credentials
|
For changing some system files you need admin
permissions.
|
High level Steps
|
Step #
|
Activity
|
Check
|
|
1
|
Binding IP address with the host name under /etc/hosts
|
|
|
2
|
Setting
passwordless SSH
|
|
|
3
|
Installing Java
|
|
|
4
|
Installing
Hadoop
|
|
|
5
|
Setting JAVA HOME and HADOOP HOME variables
|
|
|
6
|
Updating
.bash_profile file for hadoop
|
|
|
7
|
Creating required folders for namenode and datanode
|
|
|
8
|
Configuring
the .xml files
|
|
|
9
|
Setting the masters and slaves in all the machines
|
|
|
10
|
Formatting
the namenode
|
|
|
11
|
Starting the Dfs services and mapred services
|
|
|
12
|
Stopping all
services
|
|
Binding
IP address with the host names
Before starting the installation of
hadoop, first you need to bind the IP address of the machines along with their
host names under /etc/hosts file.
First check the hostname of your
machine by using following command :
$ hostname
Open /etc/hosts file for binding IP
with the hostname
$ vi /etc/hosts
Provide ip & hostname of the all the machines in the
cluster
e.g: 10.11.22.33
hostname1
10.11.22.34
hostname2
Setting Passwordless SSh login
SSH is used to login from one system
to another without requiring passwords. This will be required when you run a
cluster, it will not prompt you for the password again and again.
First
log in on Host1 (hostname of namenode
machine) as hadoop user and generate a pair of authentication keys. Command is:
hadoop@Host1$ ssh-keygen –t rsa
Note: Give the hostname which you got in step 5.3.1. Do not enter any passphrase if asked.
Now
use ssh to create a directory ~/.ssh as
user hadoop on Host2 (Hostname other
than namenode machine).
hadoop@Host1$ ssh hadoop@Host2
mkdir
–p .ssh
hadoop@Host2’s
password:
Finally append
Host1's new public key to hadoop@Host2: .ssh/authorized_keys and enter Host2's
password one last time:
hadoop@Host1$ cat /home/hadoop/.ssh/id_rsa.pub | ssh
hadoop@Host2 ‘ cat >>
.ssh/authorized_keys ’
hadoop@Host2’s password:
From now on you
can log into Host2 as hadoop from Host1 without password:
hadoop@Host1$ ssh
hadoop@Host2
Host2@hadoop$
NOTE: Do the following changes:
·
Change the permissions of .ssh to 700
·
Change the permissions of .ssh/authorized_keys
to 640
Prepare for installation
Check for previous installed versions of java and hadoop on your
machine
$ rpm –qa | grep
java
It will display fully
qualified paths of the version installed.
Remove all the previous version of Java and Hadoop installed on
the machine.
$ rpm –e
softwarename or path-name
NOTE: All the
installations and extractions are being done in /home/hadoop/
Installing Java
Use the JDK bin file
(jdk-6u25-linux-i586.bin) for installing java on your machine . Copy the .bin
file in /home/hadoop/
Execute the
command “./jdk-6u25-linux-i586.bin”
in /home/hadoop/ (which will unzip the contents into folder jdk1.6.0_25)
Extract the hadoop package
Syntax
$ tar –xzvf
< hadoop-tar-package>
$ tar –xzvf
hadoop-0.20.0-cdh3u9.tar.gz
Configuring
HADOOP_HOME
Check whether HADOOP_HOME is set up to the folder containing
hadoop_core_VERSION.jar using
$ echo $HADOOP_HOME
If not set then set
it
$ export
HADOOP_HOME=/home/hadoop/hadoop-version
For e.g.
$ cd
/home/hadoop/
$ export
HADDOP_HOME= /home/hadoop/hadoop-0.20.2-cdh3u0/
Setting JAVA_HOME
$ cd /home/hadoop/hadoop-0.20.0-cdh3u0/conf/
$ vi hadoop-env.sh
hadoop-env.sh file
for setting JAVA_HOME
Press :wq to save
and exit the file
You need to change
the bash file also .
$ vi ~/.bash_profile
Check for hadoop
installation confirmation
Run hadoop command to confirm whether the installation is
successful.
$ cd
<hadoop-home-directory>
Standard Path
$ cd
/home/hadoop/ hadoop-0.20.0-cdh3u0/
$ bin/hadoop
On successful installation you should get the following message.
CONFIGURING
HADOOP IN FULLY DISTRIBUTED MODE
Create the dfs.name.dir local directories on namenode machine
$ cd /home/hadoop/
$ mkdir -p data/1/dfs/nn
Creating the directories for storing
the Data blocks and the temporary directory for storing process ids on datanode
machines
$ cd
/home/hadoop/
$ mkdir –p data/1/dfs/dn
data/2/dfs/dn data/3/dfs/dn
$ mkdir –p
/home/hadoop/ tmp
Creating the directories for storing
the temporary data (Task Tracker) and the system files for Map/Reduce jobs
$ cd
/home/hadoop/
$ mkdir –p data/1/mapred/local data/2/mapred/local data/3/mapred/local
$ mkdir
–p /home/hadoop/mapred/system
Give full permission to all folder
under /home/hadoop/
$ cd
/home/hadoop/
$ chmod 777 *
Navigate to
/home/hadoop/hadoop-0.20.0-cdh3u0/conf directory
$ cd /home/hadoop/hadoop-0.20.0-cdh3u0/conf
Set up the configuration files under
/home/hadoop/hadoop-0.20.0-cdh3u0/conf/
Core-Site.xml
$ vi
core-site.xml
Parameters of core-site.xml
fs.default.name à URL for the Name Node
hadoop.tmp.dirà URL for the temporary data
dfs.replicationà This property
specifies the number of times the file has to be replicated on cluster.
hdfs-site.xml
$ vi
hdfs-site.xml
Parameters of hdfs-site.xml
dfs.name.dirà This property
specifies the directories where the NameNode stores its metadata and edit logs.
Represented by the /home/hadoop/data/1/dfs/nn path examples.
dfs.data.dirà This property
specifies the directories where the DataNode stores blocks. Represented by the
/home/hadoop/data/1/dfs/dn, /home/hadoop/data/2/dfs/dn ,
/home/hadoop/data/3/dfs/dn
Press :wq to save
and exit the file
mapred-site.xml
$ vi
mapred-site.xml
Parameters of mapred-site.xml
mapred.local.dir à This property
specifies the directories where the TaskTracker will store temporary data and
intermediate map output files while running Map Reduce jobs.
Eg./home/hadoop/data/1/mapred/local,/home/hadoop/data/2/mapred/local,
/home/hadoop/data/3/mapred/local.
mapred.system.dir àPath on the HDFS where the Map
Reduce framework stores system files
e.g. /home/hadoop/mapred/system/.
mapred.job.tracker à Host or IP and port of Job Tracker.
Press :wq to save
and exit the file
Set the correct owner and permissions of the local directories:
|
Directory
|
Owner
|
Permissions
|
|
dfs.name.dir
|
hdfs:hadoop
|
drwx------
|
|
dfs.data.dir
|
hdfs:hadoop
|
drwx------
|
|
mapred.local.dir
|
mapred:hadoop
|
drwxr-xr-x
|
|
|
$ chmod 700
/home/hadoop/data/1/dfs/nn/
$ chmod 700
/home/hadoop/data/1/dfs/dn/
/home/hadoop/data/2/dfs/dn/ /home/hadoop/data/3/dfs/dn/
$ chmod 755
/home/hadoop/data/1/mapred/local/
/home/hadoop/data/2/mapred/local/ /home/hadoop/data/3/mapred/local/
|
Setting up the
masters and slaves
vi conf/masters
hostname of machine acting as a SecondaryNamenode
vi slaves
hostname of machines acting as a Datanode & TaskTrackers
Formatting the
namenode
You need to
format the namenode every time you start the dfs services. This is because
every time you start the services it causes some files to be written in the
namenode folder which may get duplicated when you run the services for the
second time. Do not format a running Hadoop namenode, otherwise it will cause
all your data in the HDFS filesytem to be erased.
$ cd
/home/hadoop/hadoop-0.20.0-cdh3u0/
$ bin/hadoop
namenode -format
Note :
Give “Y” when it asks for re-format
Starting dfs
service
Run the
command
$ /bin/start-dfs.sh on the
machine you want the namenode to run on. This will bring up HDFS with the
namenode running on the machine you ran the previous command on, and datanodes
on the machines listed in the conf/slaves file.
$ cd
/home/hadoop/hadoop-0.20.0-cdh3u0/
$
./bin/start-dfs.sh
NOTE: For any problems
check the log files under all the machines under
/home/hadoop/hadoop-0.20.2-cdh3u0/logs/ and refer to the troubleshooting guide
for the same.
Starting mapred
service
For
mapred services run the following command on the machine you want jobtracker to
run on (in my case it was namenode machine)
you can choose other machine also.
$
./bin/start-mapred.sh
Checking the DFS
service report
$ ./bin/hadoop
dfsadmin –report
Checking on web interface DFS
SERVICE
http://ip-address of
namenode machine:50070/
Checking on web interface Mapred
Job
http://ip of namenode machine:50030/
Stopping dfs and
mapred services
$ cd
/home/hadoop/hadoop-0.20.0-cdh3u0/
$
./bin/stop-all.sh






No comments:
Post a Comment