During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. The conference itself was 5 days, but I did not think I could hang with statisticians for five days, and more importantly, taking three days off the attend the conference was more reasonable since I work in industry.
I arrived at the conference on the third day, July 31. Unfortunately, upon arriving at the Manchester Grand Hyatt, I was informed that they had screwed up my reservation and that they had overbooked for the night. No wonder I never received a confirmation email despite calling them three times. Grmph. But my faith in humanity was restored by their compensation package: one night FREE next store at the Marriott which was closer to the conference, FREE parking for the entire stay, FREE Internet for the entire stay, and FREE breakfast at their buffet for the entire stay. I also got a free upgrade to a room with a view of the bay and the USS Midway.
My First Day, July 31
After getting lost in the convention center for quite a while, a friend I had met at KDD-2011 was one of the first people I saw and encouraged me to attend the Facebook talk, strangely titled Stat-Us. Anyway, all of the speakers in this session were Data Scientists at Facebook, including Jonathan Chang, the author of the LDA package in R, called lda. Most of their talks discussed cleaning the social graph using machine learning. Facebook actively removes fan pages etc. that are deemed to be duplicates using decision trees and other algorithms with features such as age of the creator, grammar and number of fans. Jonathan Chang discussed disambiguating and clustering places for the Places product. For example, users have created several places to refer to Disneyland: “disneyland”, “Disney Land”, “Happiest Place on Earth” etc. Facebook uses some NLP techniques, but even more effective are their techniques that compare the distribution and demographics of check-ins to each of the places. Of course, seasonality in check-ins is another aspect used in their model. But what about ubiquitous venues such as McDonald’s or Starbucks? By studying the radial density of these establishments, it is easier to disambiguate which location a user is discussing. This also makes it easier to correct places tagged at the wrong location. One other interesting point is sterile computing. Much of the information Facebook data scientists work with is high personal. During some sensitive analyses, data scientists use sterile machines that are not connected to the Internet to perform their analysis. Analysis, diagnostics and graphics to be conducted during the process must be constructed beforehand. Also, data scientists have no access to any of the raw data in this system, only the results of their analysis. All in all, I was excited. I was surprised at the level of detail they provided.
That afternoon I attended one of the many sessions titled Clustering. I had a difficult time choosing among L1 Regression, Prediction in Social Networks and Clustering, but since my dissertation topic involves Latent Dirichlet Allocation, a form of clustering text, I felt this was the most interesting. Most of the talks were very good. The most interesting talks in the session were about analyzing massive structured data using sparse generalized PCA, estimating similarity metrics to evaluate clustering algorithms, clustering autoregressive time series and using clustering to detect network intrusions.
After the Clustering talk, I got to meet John Ramey (Baylor, @ramhiser), one of the speakers and also a Twitter friend. As we were talking, Yihui Xie (Iowa State, @xieyihui) joined us. It wasn’t until we were all in conversation that I realized Yihui is the author of the now popular knitR package! Later, John and I grabbed a drink nearby and talked about R, Python, computing and academia. It was a great conversation because we both are in similar interests in both statistics and computer science and we both have a similar way of using tools of solving problems. This was great to hear because lately I have been working as a data scientist in more of a software engineering capacity than a typical data scientist capacity. Since this field is still new, and since I am still new to industry, I am not sure what is “typical” yet. I also got some great advice about publications and getting back into academia. (My “life plan” is to work in industry in a research capacity and return to academia later in life.) Later in the evening I met up with some colleagues that have since graduated. I also got to meet an intern data scientist from Redfin.
My Second Day, August 1
The next day I had my usual stressful dilemma of trying to pick which sessions to attend. I was faced with the choice of non-parametrics, high dimensional learning, cheating on tests, and networks. My original interest in the field of Statistics was a fascination with Psychometrics, so I attended the “Statistics in Uncovering Administrative Cheating on Tests” first. I only attended the first talk which was by CTB/McGraw-Hill. They were on my short list of companies I wanted to work for when I started graduate school, and some of their products include familiar names like the California Achievement Tests (CAT/6) and California Test of Basic Skills (CTBS) both of which are deprecated and replaced by TerraNova. The speaker addressed three issues: copied answers, fraudulent erasures and stolen test items. Detection of all of these depended on the three parameter logistic model common in item response theory (IRT). Detecting copying simply involved doing pairwise comparisons of multinomial distributions (the response to each item forms a multinomial random variable). Fradulent erasures were more interesting, and occur when a teacher changes a student’s incorrect answer to a correct one. Empirically, students usually change an answer from an incorrect one to a correct one, but sometimes the opposite occurs. McGraw-Hill is able to detect fraudulent erasures as an unusually high number of modified answers from incorrect to correct compared to the student’s ability. Modern optical mark readers (“Scantron“s) report not only the selected answer, but also a parameter that shows other answers that were chosen and erased, and how dark the mark was. Stolen test items can be detected using the amount of time it took a student to mark an answer with respect to their ability. Stolen items will have a statistically significantly large residual where the student answered the item much faster than they should have given their ability. The speaker showed an illuminating example from the GMAT. After this first talk, I ran over to catch the high dimensional learning talk. The most interesting talk was from Peter Hall, where he considered using pairs and groups of interacting variables for machine learning prediction. Although this was interesting, this is very common in machine learning already. I then ran back to the networks talk. At that point, I was exhausted from running back and forth and could only try to listen.
Later in the morning I attended the Internet-Scale Statistical Computing talk. For some reason, the organizers chose a room that was far too small for the interest. I cannot say I am surprised. One of the talks discussed how Google scales R. Although the talk was great, we all know that Google is probably never going to open-source what they are working on, and that was clear from their responses to some of the audience questions. Google has created a package for R called flume which they say is similar to open-source projects Cascading and Clank. They also use an abstraction they call distributed data objects which is probably some way of storing and replicating data across systems using GFS. The final talk was from Saptarshi Guha from Mozilla. Saptarshi developed the package RHIPE which is the original interface to Hadoop from R and is still in active development. Mozilla uses RHIPE to analyze Firefox crash logs and Saptarshi showed an example of using quantile regression in a RHIPE context. After lunch I attended a talk on the LASSO, but I was far too exhausted to hear about theoretical methods and applications to biological data. All I could think of is that song by Phoenix…
If there is one way to bore me, it is to make me listen to a talk about biological and medical applications of statistics.
–Ryan
My Final Day, August 2
I attended the Visualizing Complex Models talk first thing in the morning. These talks basically discussed methods for visualizing specific model types related to the speaker’s research rather than a broad overarching discussion of complex models. Models that were discussed include the hierarchical linear model (HLM), likert scales, generalized linear models (GLM), logic forests and maps. I learned about a technique called logic regression which attempts to predict a conjunction or disjunction of terms using predictors which are themselves conjunctions or disjunctions of logical atoms. I would not be surprised if this has already been done in the artificial intelligence community, but it was very cool to see it discussed from the statistics angle. The most entertaining talk came from Samuel Buttrey from the Naval Postgraduate School. He discussed a system called DaViTo which is basically a Java dashboard graphically displaying statistics about incidents occurring in Afghanistan including where they occurred (maps) and how they vary throughout the day and throughout the year. The beauty was that the backend which computed the statistics and drew the graphs was R. Buttrey had a knack for keeping the audience on its toes like a comedian. At one point he even rapped which is something I have never seen at a conference. For those inquiring minds, it went something like…
Then ya spot a fine woman sittin in your row
She’s dressed in yellow, she says “Hello,
come sit next to me you fine fellow.”— Bust a Move, Young MC
Um, yeah.
Free Time
JSM was rare in that I always felt like I was in a rush to get from one talk to another and I felt like I had very little free time. I spent most of Thursday afternoon and Friday touring San Diego. I love the beach so I spent most of my time walking through Seaport Village and the neighboring parks. I also visited many of the tiki looking bars in the area. I also spent a few hours at the U.S.S. Midway which was a very interesting but strange experience. Apparently I am claustrophobic because not being able to see the sun while I was on the lower decks and having to crouch on the stairs was kind of scary. I also got lost in the ship’s maze of hallways and they had these dolls and animatronic figures that freaked me out. The flight deck was amazing though and I got to talk to a few of the WWII-era docents. It also provided a great view of San Diego Bay, Coronado Island and the departures and arrivals at San Diego International. It was cool getting to see such a big part of U.S. history.
Conclusion and Differences from Other Conferences
I had a great time at JSM and it was great to see what I have and have not been missing. Still, I prefer Computer Science conferences. I love mathematics, and I love getting into the hairy details… but only of my own research. I found that most JSM talks got too lost in the math and often times the speaker did not really convey the grand point.
JSM has staunch differences from other conferences I have attended logistically:
Next year JSM will be held in Montreal, Quebec, Canada. If I submit something and it is accepted, I may attend. Otherwise, I am happy I got to go to a JSM after all these years.
Finally, I’d like to thank my employer, GumGum, for allowing me to attend JSM 2012 during the work week.