What is Hadoop?
Hadoop is a
framework to store huge amounts of data on large clusters. Hadoop brings into
use a computational process named Map/Reduce, where the task is chopped into
small number of subtasks. These can be run or re-run on any number of nodes in
the cluster.
Use Case
Ø Stores terabytes of data: HDFS uses cheap
and low configuration computers and storage devices and gathers the storage
space on all the systems into one large piece.
Ø Streaming access of data: HDFS is
designed for batch processing rather than interactively by users. The focus is
on high throughput of data access than on low latency of data access. Thus
humongous data is what hadoop framework is made for at the price of increased
network bandwidth.
Ø Large data sets: File sizes
typically in gigabytes. HDFS is configured to support large files. It must
provide high total data bandwidth and scale to large number of nodes in a
single cluster
Ø WORM requirement: HDFS
applications need a write-once-read-many access paradigm for files. A file when
created, written, and closed is not required to be changed. This simplification
minimizes data coherency issues and causes high throughput data access.
Ø High availability: HDFS stores
multiple examples of data on various systems in the cluster. This helps in high
availability of data even if any system is down.
Architecture
Master-Slave architecture is used on Hadoop. An HDFS cluster has a
single Namenode (master node) which does the task of
managing the file system namespace.It also helps controlling access to files by
clients. There are a large number of Datanodes (Slave nodes), generally one per master node in the cluster, which controls
storage attached to the machines that they run on. HDFS presents a file
system namespace that helps user data to be stored in files. Inside, file
splitting happens that partitions the file into one or more blocks. These
blocks are kept in a set of Datanodes. The master node (namenode) runs file
system namespace operations viz. opening, closing, and renaming files and also
directories. It also finds the mapping of blocks to respective Datanodes. The
Datanode functions read and write requests from the file system’s clients. The
Datanodes create, delete, and replicate blocks upon instruction from the
Namenode. The Namenode is the manager and storage for all HDFS metadata. In this
system data never goes through
the Namenode.
Key Features
Ø File System Namespace: HDFS is a
hierarchical file system. It supports create, remove, move & rename
operation on files as well as directories.
Ø Replication: HDFS keeps files as sequence of
blocks. Replication of files helps in fault tolerance. The replication factor
and block size are to be configured. Files in HDFS are write-once and have
mandatorily one writer at any time. In very large clusters, nodes are
distributed across racks. Replicating data across various racks reduces network
bandwidth. To reduce global bandwidth usage and also to minimize read latency,
HDFS tries a read request from a replica that is closest to the reader.
What
is Hadoop not designed for?
Hadoop is not designed
for random reads/writes and sub-second responses. To solve such problems HBase
is designed on top of HDFS.

No comments:
Post a Comment