I woke up early and cheery Wednesday morning to attend the 2011 Hadoop Summit in Santa Clara, after a long drive from Los Angeles and the Big Data Camp that lasted until 10pm the night before. Having been to Hadoop Summit 2010, I was interested to see how much of the content in the conference had changed.
This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.
Keynotes
We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year’s. One interesting application of Hadoop at Yahoo was for “retiling” the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took 5 days. Yahoo also uses Hadoop for fraud detection, spam detection, search assist, geotagging data/local indexing, ad targeting, predicting supply and demand and the aggregation and categorization of news stories. Jay also mentioned that Dapper runs models on data with Hadoop for ad personalization. Jay also mentioned that Big Data conferences all over the country are selling out.
Eric Baldeschwieler, the CEO of Hortonworks was next. Hortonworks seems to be a new company that spun off from Yahoo. Their goal is to provide commercial support and a full Apache Hadoop platform for users. Yes, they are very similar to Cloudera, and yes, they are competition. (Hortonworks and MapR both did a good job of not stepping on everyone’s toes in terms of how they presented themselves.) Cloudera provides its own distribution of Hadoop, which is of course similar to the Apache version. Hortonworks’ goal is to provide similar services, but with more transparency by using the Apache Hadoop distribution rather than wrapping its own. Paraphrasing Eric, Hortonworks is open-source from the ground up. A bit later, Sanjay Radia also of Hortonworks discussed Hadoop for the enterprise. Hortonworks has contributed, or is working on security (preventing users from deleting others’ data), service level agreements (SLAs), predictability and a Fair-Share scheduler.
Anant Jhingran, CTO of IBM discussed how Hadoop was used in IBM Watson. It seemed pretty obvious that Hadoop or some form of map-reduce was used in the system, but it did not seem to be highly publicized. Watson learned from 200 million pages of data, about 2-5TB and required between 3000 and 4000 Watts. Anant went quickly through a cool user interface representing a Jeopardy board and stated that the user interface to an artificial intelligence application is just important as the application itself. He also prefers the term IA (intelligence augmentation) over AI, and apparently this is a common distinction. To me, I interpret AI vs. IA to be artificial intelligence vs. knowledge discovery (data mining).
Karthic Ranganathan from Facebook discussed Facebook’s messaging system which was built on HBase, HDFS and MapReduce. Facebook sees 15 billion messages per month, excluding SMS and email, approximately 14TB of data! There are also 120 billion chat messages (25TB), for a grand total of almost 300TB per month. (I may have missed something as these numbers do not add up). Facebook uses HBase for the bodies of small messages, metadata, and for the search index. Facebook uses HBase because of its high write throughput and easy horizontal scalability. Facebook uses another system called Haystack for photos, bodies of large messages and attachments. Of course, HDFS is used for fault tolerance, scalability, checksums for data integrity and its MapReduce abilities. Profiles and services are partitioned by user. Each machine has 16 cores, with 12 1TB hard disks, and 48GB RAM (24GB used for HBase). Some things that Facebook would like contribute and improve: NameNode high availability and a second NameNode, better performance overall, and using flash memory to improve performance. Facebook often adds several columns to a table so that DevOps does not need to take the server offline to add new columns.
Big Data conferences all over the country are selling out.
Breakout Sessions
There were so many great sessions and I can only summarize the ones I attended. Check out the event agenda for abstracts on all sessions.
First I attended Web Crawl Cache – Using HBase to Manage a Copy of the Web. In this talk, we learned about Yahoo’s Web Crawl Cache (WCC) that collects and organizes data from Microsoft as a result of a search deal. These snapshots of the web are not only useful for search, but also for drilling into other avenues such as local assets, influence and language corpora. WCC uses HBase for several reasons: bulk load, MapReduce jobs are efficient, random access reads, a usable consistency model, and it is easy to dynamically add columns (this seems to contradict Karthick’s claim).
It was very difficult to pick a session for the 1:45 to 2:15 time slot. Options included Next Generation Hadoop, Scaling out Realtime Data (Facebook) and Building Kafka (LinkedIn). I admire the work and clout that LinkedIn has built over the past year or two, so I attended Jay Kreps session. LinkedIn’s data pipeline includes a lot of tracking, logging, metrics, messages and queuing. LinkedIn attempted to use messaging systems such as JMS and RabbitMQ. Streaming data is prevalent at LinkedIn such as search trends, click trends, invitation social networks etc. Kafka is LinkedIn’s solution for a distributed message queue; rather than polling for data, users subscribe to a data stream and data sources publish data to it. Kafka is 7000 lines of Scala, a functional and object-oriented language on top of the Java Virtual Machine (JVM). Kafka can produce about 250,000 messages per second (50 MB) and consume 550,000 messages per second (110 MB).
Next I attended another talk by Hortonworks, this time on HCatalog. HCatalog changes the way we think about data in HDFS. No longer do we need to worry about files and directories. Instead, HCatalog seems to add a layer of abstraction on top of HDFS that treats data as a set of tables. Tools such as Pig and Hive use this layer of abstraction, and currently Hive is tightly integrated with HCatalog. Hortonworks intends to add support for HBase and Streaming later this year.
I waited all day to see Matei Zaharia‘s talk on Spark. Zaharia is a graduate student at UC Berkeley and it was a nice change of pace to see a student present some work. Spark is a data processing platform that sits on top of the Mesos cluster management project (also produced by Berkeley). Mesos can handle 10,000s nodes, 100s of concurrent jobs and can be isolated in Linux containers (i.e. OpenVZ). Spark aims to extend MapReduce for iterative algorithms, and interactive low latency data mining. One major difference between MapReduce and Spark is that MapReduce is acyclic. That is, data flows in from a stable source, is processed, and flows out to a stable filesystem. Spark allows iterative computation on the same data, which would form a cycle if jobs were visualized. Resilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use. This last point is very important; Spark allows data to be committed in RAM for an approximate 20x speedup over MapReduce based on disks. RDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce. RDD immutability is similar to immutable types in functional programming languages. It does not mean that the dataset cannot change. Instead, it means that a new copy of the dataset is created, with the change included. The user can also perform actions on RDDs such as count, collect, etc. Some applications using Spark are traffic prediction (Berkeley), spam classification (Twitter), kmeans, alternating least squares matrix factorization, and network simulation.
The main takeaway from Hadoop Summit 2010 was Cascalog. I predict the main takeaway from Hadoop Summit 2011 is Spark.
One time at work I had a bizarre issue with corrupted data in HDFS. After that, I began blaming everything on HDFS. The next session Data Integrity and Availability of HDFS was englightening. HDFS takes good care of Yahoo’s data. We can trust Yahoo because if HDFS breaks, Yahoo begins losing money so they know what they are talking about! Yahoo’s goal is to have 60 PB online all the time. The key to HDFS reliability is replication. A replication factor of 3 (3 copies of every file? block?) is appropriate. A replication factor of 2 is also quite robust, but should only be used when there is a backup of the data because the probability of data loss is much higher. Yahoo has had issues with losing blocks (blocks are pieces of data, so lost blocks = data loss). There are a variety of reasons and most of them had nothing to do with HDFS. One cause of lost blocks is a bug in a Hadoop component like Pig, particularly a new version. In one incident, a new version of Pig opened a lot of files without closing them, and created a lot of abandoned files. In the speaker’s anecdotal case study, none of the incidents of data loss were caused by HDFS proper. Other causes of data loss encoutered were exhausting disk space, users hammering HDFS, and “other.” The speaker noted that NameNode high availability (a hot topic) would have only helped in 8 of the 36 incidents studied. Some ways of preventing data loss include resource allocation, selecting good tenants of a cluster, and fixing hardware errors quickly.
If your job isn’t running, it’s not likely caused by HDFS.
Bill Graham of CBS Interactive gave an interesting talk about using Hadoop to build a graph of users a content. Surprisingly, CBSi has quite a large arsenal of MapReduce enabled technologies: Chukwa, Pig, Hive, HBase, Cascading, Sqoop and Oozie. CBSi uses only 100 nodes with 500 TB of disk space for processing data associated with 235 million uniques (individuals, roughly). Mapping users to content should be easy, right? Well, some users have multiple identities, including anonymous identities. The goal is to create a holistic graph that “matches” all of the identities efficiently for uses such as ad targeting. CBSi’s needs in a Hadoop platform: rapid experimentation and data mining, and to power new site features and ad optimization. The main vehicle for representing data is a Pig RDF that allows for a kind of graph based join so to speak. CBSi hopes to add Oozie, Azkaban, HCatalog and Hama (graph processing) to its arsenal.
MapR was a very prominent sponsor of Hadoop Summit. M. C. Srivas presented a technical discussion of MapR’s capabilities and how it differs from Apache Hadoop. MapR is a full distribution of Hadoop and is 100% compatible with the Apache distribution and projects such as Pig and Oozie. MapR is fast and boasts high availability by rethinking the NameNode. The NameNode is a bottleneck since 60% of file operations are metadata. The NameNode and its limitations limit the size of a cluster. To resolve some problems with the NameNode, MapR turns every server into a metadata server. Since metadata is seldom retrieved, it is paged to disk so more RAM can be used for MapReduce proper. MapR distributes NameNode functionality and provides full random read and write semantics as well as export to NFS. With the distributed NameNode, runaway tasks no longer take down the NameNode. MapR has some lofty performance goals. While HDFS can handle 10-50PB, MapR can handle 1010 EB (exabytes). While HDFS can handle 2000 nodes in a cluster, MapR can handle 10,000 or more. It was mentioned at BigDataCamp that MapR does not rely on HDFS at all.
The final session I attended was Avery Ching’s talk on Giraph: Large-scale Graph Processing on Hadoop. Unfortunately, Avery jumped right into the technical details of Giraph without giving a high level overview of the problem Giraph solves. Also, his slides were in 10 point font and I could not read them. Combine this with the fact that my brain was exhausted, so I wanted to head to the bar. Vanilla Hadoop incurs too much overhead for graph data processing. Yahoo used MPI in the past for graph data but it had no fault tolerance and was too generic. Giraph is a library for iterative graph processing. Giraph is fault-tolerant and dynamic. Giraph takes a vertex centric approach to graph data. I found this interesting because most of my work is edge centric. Overall, Giraph is similar in goal to Pregel, but available to non-Googlers and has no single point of failure (except those incurred by Hadoop).
Now I have to catch my breath with some wine, beer and cheese at the nice happy hour reception afterwards. It was a long day, and a great day at Hadoop Summit 2011 and I will of course be back next year. I have no clue what is in store for me next year. Will the NameNode be removed as the single point of failure? Will other open-source software start integrating Hadoop? We shall see…
And now it is time to head back to Los Angeles.
Have a happy and safe Fourth of July!