Overview of HDFS

HDFS (Hadoop Distributed Filesystem) is a core part of an Apache Hadoop distribution and allows users to store massive quantities of data in a redundant fashion across a cluster.

HDFS is most commonly used during Hadoop MapReduce processing (or also for example technologies like Hive or Pig), but it can be used by other frameworks (like Apache Spark) or it can be used to support any conceivable application via its API.

Several important characteristics of HDFS are:

  • It’s designed to run on commodity hardware, scale to thousands of servers and run in a fault-tolerant and very economical manner
  • By default it stores three copies of a file on three different machines in the cluster (this number can be changed).  If a machine fails in a cluster, HDFS will still properly execute a read or write operation in progress and continuing running.
  • It is optimized to stream large volumes of data from large files for the purpose of analytics in which an entire dataset is being read and processed

For example, typical HDFS example applications could be:

  • Creating an inverted term index from a corpus of millions of text documents
  • Running social network analytics against billions of tweets

In both of the above cases, we are processing the entire dataset, not searching for individual records. HDFS will reliably and rapidly stream the data out.  Ideally each file is large, at least 64 MB by default, but could be gigabytes or more in size.

Hadoop vendors like Cloudera claim that Hadoop storage is approximately one-tenth the cost of that for traditional commercial data warehouse products (this is an approximation since many factors are involved).  Since the cost of HDFS storage is so low, many organizations are able to make all their data “hot” (readily accessible) where in the past this was not possible due the cost of a SAN or data warehouse.  This is a generalization but is mentioned to underscore the potential HDFS has for revolutionary approaches to data management for many organizations.

HDFS has both an API (typically Java) and a command line program.  So one can write programs that tap into HDFS or one can interact with it via command line.  Here’s an example of copying some files from the local filesystem to a directory on HDFS, via the command line program ‘hadoop’:

hadoop fs -copyFromLocal ~/tweets/*.csv /user/david/tweets/input  

Hortonworks has a good tutorial on interacting with HDFS from the command line.  Many of the operations are similar to Linux filesystem operations, e.g. cp for copy, mkdir for creating directories, etc.

HDFS is nearly ten years old and has remained a resilient and popular open source solution to the problem or reliably storing large quantities of data.  It is free and relatively easy to install and experiment with.  Learning Hadoop is the best way to learn HDFS.

Further Reading: