Tuesday, May 17, 2011

Realtime Hadoop usage at Facebook -- Part 1

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. It uses HDFS and HBase as core technologies for this solution. Since then, there are many more applications that have started to used HBase. We have gained some experience in deploying and operating HDFS and HBase at peta-byte scale for realtime-workloads and decided to write a paper detailing some of these insights. This paper will be published in SIGMOD 2011.

You can find the full paper here later, but here are some highlights:

WHY HADOOP AND HBASE

The requirements for the storage system for our workloads can be summarized as follows:

1. Elasticity: We need to be able to add incremental capacity to our storage systems with minimal overhead and no downtime. In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across new hardware.

2. High write throughput: Most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput.

3. Efficient and low-latency strong consistency semantics within a data center: There are important applications like Messages that require strong consistency within a data center. This requirement often arises directly from user expectations. For example ‘‘unread’’ message counts displayed on the home page and the messages shown in the inbox page view should be consistent with respect to each other. While a globally distributed strongly consistent system is practically impossible, a system that could at least provide strong consistency within a data center would make it possible to provide a good user experience. We also knew that (unlike other Facebook applications), Messages was easy to federate so that a particular user could be served entirely out of a single data center making strong consistency within a single data center a critical requirement for the Messages project. Similarly, other projects, like realtime log aggregation, may be deployed entirely within one data center and are much easier to program if the system provides strong consistency guarantees.

4. Efficient random reads from disk: In spite of the widespread use of application level caches (whether embedded or via memcached), at Facebook scale, a lot of accesses miss the cache and hit the back-end storage system. MySQL is very efficient at performing random reads from disk and any new system would have to be comparable.

5. High Availability and Disaster Recovery: We need to provide a service with very high uptime to users that covers both planned and unplanned events (examples of the former being events like software upgrades and addition of hardware/capacity and the latter exemplified by failures of hardware components). We also need to be able to tolerate the loss of a data center with minimal data loss and be able to serve data out of another data center in a reasonable time frame.
6. Fault Isolation: Our long experience running large farms of MySQL databases has shown us that fault isolation is critical. Individual databases can and do go down, but only a small fraction of users are affected by any such event. Similarly, in our warehouse usage of Hadoop, individual disk failures affect only a small part of the data and the system quickly recovers from such faults.

7. Atomic read-modify-write primitives: Atomic increments and compare-and-swap APIs have been very useful in building lockless concurrent applications and are a must have from the underlying storage system.

8. Range Scans: Several applications require efficient retrieval of a set of rows in a particular range. For example all the last 100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser.

It is also worth pointing out non-requirements:

1. Tolerance of network partitions within a single data center: Different system components are often inherently centralized. For example, MySQL servers may all be located within a few racks, and network partitions within a data center would cause major loss in serving capabilities therein. Hence every effort is made to eliminate the possibility of such events at the hardware level by having a highly redundant network design.

2. Zero Downtime in case of individual data center failure: In our experience such failures are very rare, though not impossible. In a less than ideal world where the choice of system design boils down to the choice of compromises that are acceptable, this is one compromise that we are willing to make given the low occurrence rate of such events. We might revise this non-requirement at a later time.

3. Active-active serving capability across different data centers: As mentioned before, we were comfortable making the assumption that user data could be federated across different data centers (based ideally on user locality). Latency (when user and data locality did not match up) could be masked by using an application cache close to the user.

Some less tangible factors were also at work. Systems with existing production experience for Facebook and in-house expertise were greatly preferred. When considering open-source projects, the strength of the community was an important factor. Given the level of engineering investment in building and maintaining systems like these –– it also made sense to choose a solution that was broadly applicable (rather than adopt point solutions based on differing architecture and codebases for each workload).

After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

HBase is now being used by many other workloads internally at Facebook . I will describe these different workloads in a later post.

(Credit to the authors of the paper: Dhruba Borthakur Kannan Muthukkaruppan Karthik Ranganathan Samuel Rash Joydeep Sen Sarma Jonathan Gray Nicolas Spiegelberg Hairong Kuang Dmytro Molkov Aravind Menon Rodrigo Schmidt Amitanand Aiyer)

20 comments:

  1. Nice job, Dhruba.

    Look forward to your following installments of the post.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. D - congrats to you and crew ... Well done ...

    This rocks !!

    PK

    ReplyDelete
  4. Great. Looking forward for the paper.

    ReplyDelete
  5. Thanks for sharing your insights

    ReplyDelete
  6. Thank you! Can't wait for the details!!!

    ReplyDelete
  7. Thanks for this post. I'm really looking forward to the future posts in this series.

    ReplyDelete
  8. H, if you can make the decision today, do you select hbase/hadoop again or cassandra ? And please can you give some more details about this from to todays perspective.

    Serdar

    ReplyDelete
  9. Nice work. Wanna read more.

    ReplyDelete
  10. Dhruba:

    very illuminating. Just wanted to find out why a lot of accesses miss memcached ... is it a RAM capacity issue or is it just the nature of the accesses that, for instance, even doubling the RAM will not really improve the hit rate.

    ReplyDelete
  11. Very thoughtful to see how HBASE and HDFS work together for reliable storage and workload.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. "This" are my musings --> "These" are my musings...

    ReplyDelete
  14. The Blog title is the name of the album and the number in brackets next to it tells you the number of images currently stored. At the bottom of the page, you can see a message in green small font which gives you an overall idea of how much storage space all your Blog images have taken up. www.acnc.com

    ReplyDelete