Monday, December 5, 2011

HDFS - An Introduction

HDFS (Hadoop Distributed File System) is Apache foundation project and is sub-project to Apache Hadoop project and is well-suited for storage of large data, in tune of Petabytes. It runs on commodity hardware and scales-out simply by adding more nodes/machines.

Some of the key feature/advantages of HDFS are:
1. In built Fault-tolerance with automatic detection.
2. Portability and scalability across heterogeneous hardware and operating systems.
3. Economic efficiency by using commodity hardware.
4. Practically "infinite" scalability giving large amount of storage space.
5. Extreme reliability by keeping replicas.

So WHAT IS IT GOOD FOR... "Absolutely Everything" that has anything to do with processing and storage of  large volumes and variety of data.

Here are some of the problems that HDFS can solve for you.
1. Manage and store large amounts of data and still have ready access to data. Unlike when you put data in tapes and access to data is a big challenge.
2.  Process this large amount of data with Mapreduce framework and generate insights into business that you always wanted, but were not able to do as the available tools were not  capable of handling that volume of data.
3. Not worry about storage anymore as you can simply add another commodity hardware to the cluster and get more storage.
4. Store any kind of data without having to worry about a pre-defined schema and wait for your DBA to allocate you table space and create schema for you.
5. Run searches and queries against semi/un-structure data.

In my next post I will deep dive discuss architecture and other details of HDFS.


No comments:

Post a Comment