HDFS: 2010

Sunday, November 21, 2010

Hadoop Research Topics

Recently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on Hadoop related projects as part of their course work. One commonly asked question that I got from these students is what Hadoop feature can I work on?

Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.

Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.

The first thing for a student who wants to do any of these projects is to download the code from HDFS and MAPREDUCE. Then create an account in the bug tracking software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

If anybody else have any new project ideas, please add them as comments to this blog post.

Tuesday, June 29, 2010

America’s Most Wanted – a metric to detect faulty machines in Hadoop

Handling failures in Hadoop

I have been asked many many questions about the failure rates of machines in our Hadoop cluster. These questions vary from the innocuous how much time do you spend everyday fixing bad machines in your cluster to the more involved ones like does failure of hadoop machines depend on the heat-map of the data center? I do not have answer to these questions and this is an area of research that has been of focus lately. But I have seen that the efficiency of a Hadoop cluster is directly dependent on the amount of manpower needed to operate such a cluster.

Common Types of Failures

The three most common categories of failures I have observed are

System Errors – Hardware, OS, jvm, hadoop, compiler, etc – Hadoop aims to reduce the effect of this broad category of errors
User Application Errors – Bad code written by an user – Bloated memory usage
Anomalous Behaviour -- Not working according to expectation – Slow nodes – Causes most harm to Hadoop cluster because they go undetected for long periods of time

America's Most Wanted (AMW)

I have the previledge of working with one of the most experienced Hadoop cluster administrator Andrew Ryan. He observed that a few machines in the Hadoop cluster are always repeat offenders: they land into trouble, gets incarcerated and fixed and then when put back online they create trouble again. He came up with this metric to determine when to throw a machine out of the Hadoop cluster. The following chart shows that 3% of our machines cause 43% of all manual repair events.

This clearly shows the need for a three-strikes-you-are-out law: if a machine goes into repair three times it is better to take it permanently out of the Hadoop cluster.

An exotic question

Other exotic questions that I am frequently asked are like do map-reduce jobs written in python have a higher probability of failure? Here is a chart that tries to answer this question:

5% of all jobs in cluster are written in Python
15% of cluster CPU is consumed by Python jobs
20% of all failed jobs are written in python

This does show that jobs written in python consume more CPU on the average than jobs written in Java. It also shows that a greater percentage of these jobs are likely to fail. Why is this? I do not have a definite answer but I my guess is that a developer is more likely to write the first few version of his experimental query in python because it is an easy-to-prototype language.

I presented a more detailed version of this in a IFIP Working Group on Dependable Computing (DSN2010). The aim of this workshop is to understand more about failure patterns on Hadoop nodes, automatic ways to analyze and handle these failures and how the research community can help Hadoop become more fault-tolerant.

Sunday, May 9, 2010

Facebook has the world's largest Hadoop cluster!

It is not a secret anymore!

The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world. Here are some of the details about this single HDFS cluster:

21 PB of storage in a single HDFS cluster
2000 machines
12 TB per machine (a few machines have 24 TB each)
1200 machines with 8 cores each + 800 machines with 16 cores each
32 GB of RAM per machine
15 map-reduce tasks per machine

That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:

Hadoop started at Yahoo! and full marks to Yahoo! for developing such critical infrastructure technology in the open. I started working with Hadoop when I joined Yahoo! in 2006. Hadoop was in its infancy at that time and I was fortunate to be part of the core set of Hadoop engineers at Yahoo!. Many thanks to Doug Cutting for creating Hadoop and Eric14 for convincing the executing management at Yahoo! to develop Hadoop as open source software.

Facebook engineers work closely with the Hadoop engineering team at Yahoo! to push Hadoop to greater scalability and performance. Facebook has many Hadoop clusters, the largest among them is the one that is used for Datawarehousing. Here are some statistics that describe a few characteristics of the Facebook's Datawarehousing Hadoop cluster:

12 TB of compressed data added per day
800 TB of compressed data scanned per day
25,000 map-reduce jobs per day
65 millions files in HDFS
30,000 simultaneous clients to the HDFS NameNode

A majority of this data arrives via scribe, as desribed in scribe-hdfs integration. This data is loaded in Hive. Hive provides a very elegant way to query the data stored in Hadoop. Almost 99.9% Hadoop jobs at Facebook are generated by a Hive front-end system. We provide lots more details about our scale of operations in our paper at SIGMOD titled Datawarehousing and Analytics Infrastructure at Faceboo k.

Here are two pictorial representations of the rate of growth of the Hadoop cluster:

Details about our Hadoop configuration

I have fielded many questions from developers and system administrators about the Hadoop configuration that is deployed in the Facebook Hadoop Datawarehouse. Some of these questions are from Linux kernel developers who would like to make Linux swapping work better with Hadoop workload; other questions are from JVM developers who may attempt to make Hadoop run faster for processes with large heap size; yet others are from GPU architects who would like to port a Hadoop workload to run on GPUs. To enable this type of outside research, here are the details about the Facebook's Hadoop warehouse configurations. I hope this open sharing of infrastructure details from Facebook jumpstarts the research community to design ways and means to optimize systems for Hadoop usage.

Sunday, April 25, 2010

The Curse of the Singletons! The Vertical Scalability of Hadoop NameNode

Introduction

HDFS is designed to be a highly scalable storage system and sites at Facebook and Yahoo have 20PB size file systems in production deployments. The HDFS NameNode is the master of the Hadoop Distributed File System (HDFS). It maintains the critical data structures of the entire file system. Most of HDFS design has focussed on scalability of the system, i.e. the ability to support a large number of slave nodes in the cluster and an even larger number of files and blocks. However, a 20PB size cluster with 30K simultaneous clients requesting service from a single NameNode means that the NameNode has to run on a high-end non-commodity machine. There has been some efforts to scale the NameNode horizontally, i.e. allow the NameNode to run on multiple machines. I will defer analyzing those horizontal-scalability-efforts for a future blog post, instead let's discuss ways and means to make our singleton NameNode support an even greater load.

What are the bottlenecks of the NameNode?

Network: We have around 2000 nodes in our cluster and each node is running 9 mappers and 6 reducers simultaneously. This means that there are around 30K simultaneous clients requesting service from the NameNode. The Hive Metastore and the HDFS RaidNode imposes additional load on the NameNode. The Hadoop RPCServer has a singleton Listener Thread that pulls data from all incoming RPCs and hands it to a bunch of NameNode handler threads. Only after all the incoming parameters of the RPC are copied and deserialized by the Listener Thread does the NameNode handler threads get to process the RPC. One CPU core on our NameNode machine is completely consumed by the Listener Thread. This means that during times of high load, the Listener Thread is unable to copy and deserialize all incoming RPC data in time, thus leading to clients encountering RPC socket errors. This is one big bottleneck to vertically scalabiling of the NameNode.

CPU: The second bottleneck to scalability is the fact that most critical sections of the NameNode is protected by a singleton lock called the FSNamesystem lock. I had done some major restructuring of this code about three years ago via HADOOP-1269 but even that is not enough for supporting current workloads. Our NameNode machine has 8 cores but a fully loaded system can use at most only 2 cores simultaneously on the average; the reason being that most NameNode handler threads encounter serialization via the FSNamesystem lock.

Memory: The NameNode stores all its metadata in the main memory of the singleton machine on which it is deployed. In our cluster, we have about 60 million files and 80 million blocks; this requires the NameNode to have a heap size of about 58GB. This is huge! There isn't any more memory left to grow the NameNode's heap size! What can we do to support even greater number of files and blocks in our system?

Can we break the impasse?

RPC Server: We enhanced the Hadoop RPC Server to have a pool of Reader Threads that work in conjunction with the Listener Thread. The Listener Thread accepts a new connection from a client and then hands over the work of RPC-parameter-deserialization to one of the Reader Threads. In our case, we configured our system so that the Reader Threads consist of 8 threads. This change has doubled the number of RPCs that the NameNode can process at full throttle. This change has been contributed to the Apache code via HADOOP-6713.

The above change allowed a simulated workload to be able to consume 4 CPU cores out of a total of 8 CPU cores in the NameNode machine. Sadly enough, we still cannot get it to use all the 8 CPU cores!

FSNamesystem lock: A review of our workload showed that our NameNode typically has the following distribution of requests:

stat a file or directory 47%
open a file for read 42%
create a new file 3%
create a new directory 3%
rename a file 2%
delete a file 1%

The first two operations constitues about 90% workload for the NameNode and are readonly operations: they do not change file system metadata and do not trigger any synchronous transactions (the access time of a file is updated asynchronously). This means that if we change the FSnamesystem lock to a Readers-Writer lock we can achieve the full power of all processing cores in our NameNode machine. We did just that, and we saw yet another doubling of the processing rate of the NameNode! The load simulator can now make the NameNode process use all 8 CPU cores of the machine simultaneously. This code has been contributed to Apache Hadoop via HDFS-1093.

The memory bottleneck issue is still unresolved. People have asked me if the NameNode can keep some portion of its metadata in disk, but this will require a change in locking model design first. One cannot keep the FSNamesystem lock while reading in data from the disk: this will cause all other threads to block thus throttling the performance of the NameNode. Could one use flash memory effectively here? Maybe an LRU cache of file system metadata will work well with current metadata access patterns? If anybody has good ideas here, please share it with the Apache Hadoop community.

In a Nutshell

The two proposed enhancements have improved NameNode scalability by a factor of 8. Sweet, isn't it?

Saturday, February 6, 2010

Hadoop AvatarNode High Availability

Our Use-Case

The Hadoop Distributed File System's (HDFS) NameNode is a single point of falure. This has been a major stumbling block in using HDFS for a 24x7 type of deployment. It has been a topic of discussion among a wide circle of engineers.

I am part of a team that is operating a cluster of 1200 nodes and a total size of 12 PB. This cluster is currently running hadoop 0.20. The NameNode is configured to write its transaction logs to a shared NFS filer. The biggest cause of service downtime is when we need to deploy hadoop patches to our cluster. A fix to the DataNode software is easy to deploy without cluster downtime: we deploy new code and restart few DataNodes at a time. We can afford to do this because the DFSClient is equipped (via multiple retries to different replicas of the same block) to handle transient DataNode outages. The major problem arises when new software have to be deployed to our NameNode. A HDFS NameNode restart for a cluster of our size typically takes about an hour. During this time, the applications that use the Hadoop service cannot run. This led us to believe that some kind of High Availabilty (HA) for the NameNode is needed. We wanted a design than can support a failover in about a minute.

Why AvatarNode?

We took a serious look at the options available to us. The HDFS BackupNode is not available in 0.20, and upgrading our cluster to hadoop 0.20 is not a feasible option for us because it has to go through extensive time-consuming testing before we can deploy it in production. We started off white-boarding the simplest solution: the existance of a Primary NameNode and a Standby NameNode in our HDFS cluster. And we did not have to worry about split-brain-scenario and IO-fencing methods... the administrator ensures that the Primary NameNode is really, really dead before coverting the Standby NamenNode to become the Primary DataNode. We also did not want to change the code of the NameNode one single bit and wanted to build HA as a software layer on top of HDFS. We wanted a daemon that can switch from being a Primary to a Standby and vice-versa... as if switching avatars ... and this is how our AvatarNode is born!

The word avatar means a variant-phase or incarnations of a single entity, and we thought it appropriate that a wrapper that can make the NameNode behave as two different roles be aptly named as AvatarNode. Our design of the AvatarNode is such that it is a pure-wrapper around the existing NameNode that exists in Hadoop 0.20; thus the AvatarNode inherits the scalability, performance and robustness of the NameNode.

What is a AvatarNode?

The AvatarNode encapsulates an instance of the NameNode. You can start a AvatarNode in the Primary avatar or the Standby avatar. If you start it as a Primary avatar, then it behaves exactly as the NameNode you know now. In fact, it runs exactly the same code as the NameNode. The AvatarNode in its Primary avatar writes the HDFS transaction logs into the shared NFS filer. You can then start another instance of the AvatarNode on a different machine, but this time telling it to start as a Standby avatar.

The Standby AvatarNode encapsulates a NameNode and a SecondaryNameNode within it. The Standby AvatarNode continuously keeps reading the HDFS transaction logs from the same NFS filer and keeps feeding those transactions to the encapsulated NameNode instance. The NameNode that runs within the Standby AvatarNode is kept in SafeMode. Keeping the Standby AvatarNode is SafeMode prevents it from performing any active duties of the NameNode but at the same time keeping a copy of all NameNode metadata hot in its in-memory data structures.

HDFS clients are configured to access the NameNode via a Virtual IP Address (VIP). At the time of failover, the administrator kills the Primary AvatarNode on machine M1 and instructs the Standby AvatarNode on machine M2 (via a command line utility) to assume a Primary avatar. It is guaranteed that the Standby AvatarNode ingests all committed transactions because it reopens the edits log and consumes all transactions till the end of the file; this guarantee depends on the fact that NFS-v3 supports close-to-open cache coherency semantics. The Standby AvatarNode finishes ingestion of all transactions from the shared NFS filer and then leaves SafeMode. Then the administrator switches the VIP from M1 to M2 thus causing all HDFS clients to start accessing the NameNode instance on M2. This is practically instantaneous and the failover typically takes a few seconds!

There is no need to run a separate instance of the SecondaryNameNode. The AvatarNode, when run in the Standby avatar, performs the duties of the SecondaryNameNode too. The Standby AvatarNode comtinually ingests transactions from the Primary, but once a while it pauses the ingestion, invokes the SecondaryNameNode to create and upload a checkpoint to the Primary AvatarNode and then resumes the ingestion of transaction logs.

The AvatarDataNode

The AvatarDataNode is a wrapper around the vanilla DataNode found in hadoop 0.20. It is configured to send block reports and blockReceived messages to two AvatarNodes. The AvatarDataNode do not use the VIP to talk to the AvatarNode(s) (only HDFS clients use the VIP). An alternative to this approach would have been to make the Primary AvatarNode forward block reports to the Standby AvatarNode. We discounted this alternative approach because it adds code complexity to the Primary AvatarNode: the Primary AvatarNode would have to do buffering and flow control as part of forwarding block reports.

The Hadoop DataNode is already equipped to send (and retry if needed) blockReceived messages to the NameNode. We extend this functionality by making the AvatarDataNode send blockReceived messages to both the Primary and Standby AvatarNodes.

Does the failover affect HDFS clients?

An HDFS client that was reading a file caches the location of the replicas of almost all blocks of a file at the time of opening the file. This means that HDFS readers are not impacted by AvatarNode failover.

What happens to files that were being written by a client when the failover occurred? The Hadoop 0.20 NameNode does not record new block allocations to a file until the time when the file is closed. This means that a client that was writing to a file when the failover occured will encounter an IO exception after a successful failover event. This is somewhat tolerable for map-reduce job because the Hadoop MapReduce framework will retry any failed tasks. This solution also works well for HBase because the HBase issues a sync/hflush call to persist HDFS file contents before marking a HBase-transaction to be completed. In future, it is possible that HDFS may be enhanced to record every new block allocation in the HDFS transaction log, details here. In that case, a failover event will not impact HDFS writers as well.

Other NameNode Failover techniques

People have asked me to compare the AvatarNode HA implementation with the Hadoop with DRBD and Linux HA. Both approaches require that the Primary writes transaction logs to a shared device. The difference is that the Standby AvatarNode is a hot standby whereas the DRBD-LinuxHA solution is a cold standby. The failover time for our AvatarNode is about a minute, whereas the failover time of the DRBD-LinuxHA would require about an hour for our cluster that has about 50 million files. The reason that the AvatarNode can function as a hot standby is because the Standby AvatarNode has an instance of the NameNode encapsulated within it and this encapsulated NameNode receives messages from the DataNodes thus keeping its metadata state fully upto-date.

Another implementation that I have heard about is from a team at China Mobile Research Institue, their approach is described here.

Where is the code?

This code has been contributed to the Apache HDFS project via HDFS-976. A prerequisite for this patch is HDFS-966.

HDFS