I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.
Keynotes
KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.
For more information about convex optimization, see the website for Convex Optimization by Boyd and Vandenberghe. The book is available for free as well as lecture slides etc. Even better, the second author is from UCLA! I did not realize that.
(There was a keynote by a fellow named David Haussler about cancer genomics. After an exhausting first two days, I skipped this talk as I needed to sleep…and I was not incredibly interested in the topic.)
Pearl believes that humans do not communicate with probability, but causality (I do not agree with this entirely). I appreciated that he mentioned that it takes work to overcome the difference in thinking between probability and causality. In statistics, we use some data and a joint distribution to make inferences about some quantity or variable P. In causality, there is an intentional intervention that changes the joint distribution P into another joint distribution P’. Causality requires new language and mathematics (I do not see it). In order to use causality, one must introduce some untestable hypothesis. Pearl mentioned that some non-standard mathematical methods include counterfactuals and structural equation modeling. I do not know how I feel about any of this. For more information about Pearl’s Causality, check out his book.
Data Mining Competitions
One interesting event during KDD 2011 was the panel Lessons Learned from Contests in Data Mining. This panel featured Jeremy Howard (Kaggle), Yehuda Koren (Yahoo!), Tie-Yan Liu (Microsoft), and Claudia Perlich (Media6Degrees). Both Kaggle and Yahoo run data mining competitions: Kaggle has its own series of competitions and Yahoo is a major sponsor of the KDD Cup competition. Perlich has participated and won many data mining competitions. Liu provided a different insight into data mining competitions as an industry observer.
Jeremy Howard gave some insight into the history of data mining competitions. He credited KDD 97 with the formation of the first data mining competition. He announced to the crowd that companies spend 100 billion dollars every year on data mining products and services (not including in-house costs such as employment) and that there are approximately 2 million Data Scientists. The estimate of the number of Data Scientists was based on the number of times R was downloaded, and is an estimate based on David Smith’s (Revolution Computing) blog post. I love R, and every Data Scientist should use it, but there are several problems with this estimate. Not everyone that uses R is a Data Scientist; a large portion of R users are statisticians (“beautiful theory”), teachers, miscellaneous students etc. Second, not all Data Scientists use R. Some are even more creative and write their own tools or use little-adopted software packages. There are also a lot of Data Scientists that use Python instead of R. Howard also announced that over the next year, Kaggle with be starting 1000s of “invitation only” competitions. Personally, I do not care for this type of exclusion even though their intentions are good.
Yehuda Koren introduced the crowd to Yahoo’s involvement in data mining competitions. Yahoo is a major force behind the KDD Cup and the Heritage Foundation competition. Yahoo also won a progress award in the Netflix challenge. Koren then described how data mining competitions help the community. Competitions raise awareness and attract research to a field, end up involving the release of a cool dataset to the community, encourage contribution and education, and provide publicity for participants and winners. Contestants are attracted to competitions for various reasons including fun, competitiveness, fame, the desire to learn more, peer pressure and of course the monetary reward. As with every competition, data mining competitions have rules and Koren stated that rules are very difficult to enforce. I believe that data mining is vague as it is, so competitions would be just as vague. It is important to maximize participation by minimizing the reduction of participation while maximizing fairness and innovation. Some such “rules” include discouraging huge ensembles (which probably overfit anyway), submission frequency, team duplication, team size (the KDD Cup winning team had 25 members). Some obvious keys to success in data mining competitions are ensembles, hard work, team size, innovation vs. fancy models, quick coding and patience.
I felt that Tie-Yan Liu from Microsoft sort of served as the Simon Cowell of the panel, and I feel that his role was necessary. He provided industry insight that provided a bit of a reality check as to what data mining competitions accomplish and do not accomplish. Liu questions if the problems being solved in data mining competitions are really important problems. Part of the problem is that many datasets are censored as to protect privacy. Additionally, the really interesting problems cannot be opened to the public because they involve trade secrets. I consider myself an inclusive guy – I do not like the concept of winners and losers. I was elated that Liu brought up this point: “what about the losers?” Is it bad publicity to “lose” several (or all) competitions? The answer to this question varies person-to-person. I honestly believe that the goal of these competitions is of the open-source nature (fun, share, learn, solve) and not so much to cure cancer. They are great for college students, people that are interested in data science but do not have access to great data. For the rest of us, learning on our own using interesting data is probably better.
Claudia Perlich (Media6Degrees) discussed her experience participating in data mining competitions. She has won several contests. She commented on the distinction between sterile/cleaned data and real data as competitions can include either type. The concept of Occam’s Razor applies to data mining competitions; Perlich won most of her competitions using a linear model, but by using more complex and creative features. Perlich emphasizes that complex features are better than complex models.
Considering the Netflix Prize has been one of the biggest data mining competitions, I was disappointed that they were not represented on the panel since there were several researchers from Netflix at the conference.
Rather than write a few sentences for each topic, I will just bullet the goals of the research discussed in the sessions. Descriptions with a star (*) denote my favorite papers and are cited later.
Text Mining
I attended two of the three text mining sessions. I must say that I am quite topic-modeled and LDAed out! Latent Dirichlet Allocation (LDA) and several variations were part of every talk I heard. That was very exciting and reaffirms that I am in a hot field. Still, nobody has taken my dissertation topic yet (which I have remained quiet about).
Social Network Analysis and Graph Analysis
The Social Networks session conflicted with one of the Text Mining sessions, but since I knew there would be two more, I decided to attend this one instead. I also combined the two Graph Analysis sessions into this section since they are so related. The goals of the research presented in these talks were as follows:
User Modeling
This session I suspect was similar to the Web User Modeling session and focused on recommendation engines and rating prediction.
Frequent Sets
I did some work with itemset mining at my last job and I was not incredibly interested in the Online Data and Streams session at the time so I attended this talk.
Classification
I got stuck in this session because the session I really wanted to attend “Web User Modeling” was full and there was nowhere to sit or stand. This session was more technical and theoretical. The only session that I really enjoyed was about a classifier called CHIRP. I did not follow the details, but this is a paper that I am interested in reading. The authors used a classifier based on Composite Hypercutes on Interated Random Projections to classify spaces that have complex topology (think of classifying items that appear in a bullseye/dartboard pattern).*
Unsupervised Learning
This talk was similar to the classification talk but more practical in my opinion.
Favorite Papers
Below is a short bibliograph of papers that were my favorite. There were also a few at the poster session (the first four) that I include here.
Wrapping Up
I had an awesome time at KDD and wish I could go next year, but it will be held in Beijing. I got to meet a lot of different people in the field that have the same passion for data and that was really cool. I got to meet with recruiters from a few different companies and get some swag from Yahoo and Google.
It was awesome being around such greatness. I ran into Peter Norvig several times, ran into Judea Pearl in the restroom (I already know him), as well as Christos Faloutsos (I am a huge fan) and Ross Quinlan. I stopped at the Springer booth and found a cool book about link prediction with Faloutsos as one of the authors. I went to buy it, handed the lady my credit card, and learned that it was $206 (AFTER conference discount)! Interestingly… Amazon has the same book for $165. I will probably order it anyway.
Here’s hoping that KDD returns to California (or the US) real soon!
Candid Shots
Ross Quinlan enjoying a beer during the poster session. What a cool guy! | Christos Faloutsos talking with a student during the poster session. |