Tuesday, 6 November 2012

what is hadoop


What is Hadoop?

Hadoop is a framework to store huge amounts of data on large clusters. Hadoop brings into use a computational process named Map/Reduce, where the task is chopped into small number of subtasks. These can be run or re-run on any number of nodes in the cluster.
Use Case
Ø  Stores terabytes of data: HDFS uses cheap and low configuration computers and storage devices and gathers the storage space on all the systems into one large piece.

Ø  Streaming access of data: HDFS is designed for batch processing rather than interactively by users. The focus is on high throughput of data access than on low latency of data access. Thus humongous data is what hadoop framework is made for at the price of increased network bandwidth.

Ø  Large data sets: File sizes typically in gigabytes. HDFS is configured to support large files. It must provide high total data bandwidth and scale to large number of nodes in a single cluster

Ø  WORM requirement: HDFS applications need a write-once-read-many access paradigm for files. A file when created, written, and closed is not required to be changed. This simplification minimizes data coherency issues and causes high throughput data access.

Ø  High availability: HDFS stores multiple examples of data on various systems in the cluster. This helps in high availability of data even if any system is down.



Architecture
Master-Slave architecture is used on Hadoop. An HDFS cluster has a single Namenode (master node) which does the task of managing the file system namespace.It also helps controlling access to files by clients.  There are a large number of Datanodes (Slave nodes), generally one per master node in the cluster, which controls storage attached to the machines that they run on. HDFS presents a file system namespace that helps user data to be stored in files. Inside, file splitting happens that partitions the file into one or more blocks. These blocks are kept in a set of Datanodes. The master node (namenode) runs file system namespace operations viz. opening, closing, and renaming files and also directories. It also finds the mapping of blocks to respective Datanodes. The Datanode functions read and write requests from the file system’s clients. The Datanodes create, delete, and replicate blocks upon instruction from the Namenode. The Namenode is the manager and storage for all HDFS metadata. In this system data never goes through the Namenode.



 Key Features
Ø  File System Namespace: HDFS is a hierarchical file system. It supports create, remove, move & rename operation on files as well as directories.

Ø  Replication: HDFS keeps files as sequence of blocks. Replication of files helps in fault tolerance. The replication factor and block size are to be configured. Files in HDFS are write-once and have mandatorily one writer at any time. In very large clusters, nodes are distributed across racks. Replicating data across various racks reduces network bandwidth. To reduce global bandwidth usage and also to minimize read latency, HDFS tries a read request from a replica that is closest to the reader.


What is Hadoop not designed for?
    Hadoop is not designed for random reads/writes and sub-second responses. To solve such problems HBase is designed on top of HDFS.







No comments:

Post a Comment