Thursday, May 28, 2009

Report from my visit to the Berkeley RAD Labs

I went to attend the UC Berkeley RAD Lab Spring Retreat held at Santa Cruz. This Lab has about 30 Phd students and the quality of their work really impressed me a lot. Most of their work is based on research on distributed systems. There were many students who were working with Hadoop and it is amazing to see Hadoop being the core of so much research activity... when the Hadoop project started three years back, I definitely did not imagine that it will get to this state!

I had read David Patterson's papers during my graduate studies at Univ of Wisconsin Madison, and it was really nice to be able to meet him in person. And the group of students that he leads at the RAD labs is of very high calibre. Most people must have already seen the Above the Cloud whitepaper that the RAD Lab has produced. It tries to clear up the muddle on what Cloud Computing really is, its benefits and its possible usage scenarios. A good read!

A paper titled Detecting Large Scale System Problems by Mining Console Logs talks about using application logs from a distributed application to detect bugs, problems and anomalies in the system. They provide an example whereby they process 24 million log lines produced by HDFS to detect a bug (page 11 of the paper). I am not really convinced about the bug, but this is an effort in the right direction.
My employer Facebook is an enabler for research in distributed systems. To this effect, Facebook has allowed researchers from premier Universities to analyze Hadoop system logs. These logs typically record machine usage and Hadoop job performance metrics. The hadoop job records are inserted into a MySQL database for easy analysis (HADOOP-3708). Some students at the RAD Labs have used this database to prove that Machine Learning techniques can be used to predict Hadoop job performance. This is a paper that is not yet published. There is another paper that analyzes the performance characteristics of the Hadoop Fair share scheduler. Most of the abstract of these publications are available here.

Last, but not the least, SCADS is a scalable storage system that is specifically designed for social networking type software. It has a declarative query language and supports various consistency models. It supports a rich data model that includes joins on pre-computed queries.

Sunday, May 24, 2009

Better Late than Never

For quite a while, I have been thinking on blogging about Hadoop in general and Hadoop distributed file system (HDFS) in particular. Why, you may ask?

Firstly, I have been contacted by students from as far as Bangladesh and Fiji asking me questions about HDFS via email. This made me think that disseminating internal details about HDFS to the whole wide world would really benefit a lot of people. I like to interact with these budding engineers; and their questions, though elementary in nature, sometimes makes me really ruminate on why we adopted a particular design and not another. I will sprinkle a few of these examples next week.

Secondly, I visited a few Universities last month, among them Carnegie Mellon University and my alma-mater Univ of Wisconsin. On my flight, I was getting bored to death, because I really did not like the movie that was playing and I did not carry any material to read. (Usually I like to read and re-read Sherlock Holmes over and over again.) But like they say, " an idle mind is the devil's workhop".... I started to jot down some exotic design ideas about HDFS.... And lo behold, I have a list of ideas that I would like to share! I will post them next week as well.