Imagine you're building a recommendation algorithm for your new online site. How do you measure its quality, to make sure that it's sending users relevant and personalized content? Click-through rate may be your initial hope…but after a bit of thought, it's not clear that it's the best metric after all.
Take Google's search engine. In many cases, improving the quality of search results will decrease CTR! For example, the ideal scenario for queries like When was Barack Obama born? is that users never have to click, since the question should be answered on the page itself.
Or take Twitter, who one day might want to recommend you interesting tweets. Metrics like CTR, or even number of favorites and retweets, will probably optimize for showing quick one-liners and pictures of funny cats. But is a Reddit-like site what Twitter really wants to be? Twitter, for many people, started out as a news site, so users may prefer seeing links to deeper and more interesting content, even if they're less likely to click on each individual suggestion overall.
Or take eBay, who wants to help you find the products you want to buy. Is CTR a good measure? Perhaps not: more clicks may be an indication that you're having trouble finding what you're looking for. What about revenue? That might not be ideal either: from the user perspective, you want to make your purchases at the lowest possible price, and by optimizing for revenue, eBay may be turning you towards more expensive products that make you a less frequent customer in the long run.
And so on.
So on many online sites, it's unclear how to measure the quality of personalization and recommendations using metrics like CTR, or revenue, or dwell time, or whatever. What's an engineer to do?
Well, consider the fact that many of these are relevance algorithms. Google wants to show you relevant search results. Twitter wants to show you relevant tweets and ads. Netflix wants to recommend you relevant movies. LinkedIn wants to find you relevant people to follow. So why, so often, do we never try to measure the relevance of our models?
I'm a big fan of man-in-the-machine techniques, so to get around this problem, I'm going to talk about a human evaluation approach to measuring the performance of personalization and discovery products. In particular, I'll use the example of related book suggestions on Amazon as I walk through the rest of this post.
(Let's continue motivating a bit why log-based metrics are often imperfect measures of relevance and quality, as this is an important but difficult-to-understand point.)
So take Amazon's Customers Who Bought This Item Also Bought feature, which tries to show you related books.
To measure its effectiveness, the standard approach is to run a live experiment and measure the change in metrics like revenue or CTR.
But imagine we suddenly replace all of Amazon's book suggestions with pornography. What's going to happen? CTR's likely to shoot through the roof!
Or suppose that we replace all of Amazon's related books with shinier, more expensive items. Again, CTR and revenue are likely to increase, as the flashier content draws eyeballs. But is this anything but a short-term boost? Perhaps the change decreases total sales in the long run, as customers start to find Amazon too expensive for their tastes and move to other marketplaces.
Scenarios like these are the machine learning analogue of turning ads into blinking marquees. While they might increase clicks and views initially, they're probably not optimizing user happiness or site quality for the future. So how can we avoid them, and ensure that the quality of our suggestions remains consistently high? This is a related books algorithm, after all – so why, by sticking to a live experiment and metrics like CTR, are we nowhere inspecting the relatedness of our recommendations?
Solution: let's inject humans into the process. Computers can't measure relatedness (if they could, we'd be done), but people of course can.
For example, in the screenshot below, I asked a worker on Mechanical Turk to rate the first three Customers Who Bought This Also Bought suggestions for a book on barnesandnoble.com.
(Copying the text from the screenshot:
The recommendations are decent, but already we see a couple ways to improve them:
CTR and revenue certainly wouldn't give us this level of information, and it's not clear that they could even tell us our algorithms are producing irrelevant suggestions in the first place. Nowhere does the related scroll panel make it clear that two of the books are part of a series, so the CTR on those two books would be just as high as if they were indeed the series introductions. And if revenue is low, it's not clear whether it's because the suggestions are bad or because, separately, our pricing algorithms need improvement.
So in general, here's one way to understand the quality of a Do This Next algorithm:
Let's make this process concrete. Pretend I'm a newly minted VP of What Customers Also Buy at Amazon, and I want to understand the product's flaws and stars.
I started by going to Mechanical Turk and asking a couple hundred workers to take a book they enjoyed in the past year, and to find it on Amazon. They'd then take the first three related book suggestions from a different author, rate them on the following scale, and explain their ratings.
(Note: I usually prefer a three-point or five-point Likert scale with a "neutral" option, but I was feeling a little wild.)
For example, here's how a rater reviewed the related books for Anne Rice's The Wolves of Midwinter.
So how good are Amazon's recommendations? Quite good, in fact: 47% of raters said they'd definitely buy the first related book, another 29% said it was good enough that they might buy it, and only 24% of raters disliked the suggestion.
The second and third book suggestions, while a bit worse, seem to perform pretty well too: around 65% of raters rated them positive.
What can we learn from the bad ratings? I ran a follow-up task that asked workers to categorize the bad related books, and here's the breakdown.
So to improve their recommendations, Amazon could try improving its topic models, add age-based features to its books, distinguish between textbooks and novels, and invest in series detectors. (Of course, for all I know, they do all this already.)
We now have a general grasp of Amazon's related book suggestions and how they could be improved, and just like we could quote a metric like a CTR of 6.2% or whatnot, we can also now quote a relevance score of 0.62 (or whatever). So let's turn to the question of how Amazon compares to other online booksellers like Barnes & Noble and Google Play.
I took the same task I used above, but this time asked raters to review the related suggestions on those two sites as well.
In short,
Why are the Play Store's suggestions so bad? Let's look at a couple examples.
Here's the Play Store page for John Green's The Fault in Our Stars, a critics-loved-it book about cancer and romance (and now also a movie).
Two of the suggestions are completely random: a poorly-rated Excel manual and a poorly-reviewed textbook on sexual health. The others are completely unrelated cowboy books, by a different John Green.
Here's the page for The Strain. In this case, all the suggestions are in a different language! And there are only four of them.
Once again asking raters to categorize all of the Play Store's bad recommendations...
So despite Google's state-of-the-art machine learning elsewhere, its Play Store suggestions couldn't really get much worse.
Let's step back a bit. So far I've been focusing on an absolute judgments paradigm, in which judges rate how relevant a book is to the original on an absolute scale. This model is great for understanding the overall quality of Amazon's related book algorithm.
In many cases, though, we want to use human evaluation to compare experiments. For example, it's common at many companies to:
For these kinds of tasks, what's preferable is often a side-by-side model, wherein judges are given two items and asked which one is better. After all, comparative judgments are often much easier to make than absolute ones, and we might want to detect differences at a finer level than what's available on an absolute scale.
The idea is that we can assign a score to each rating (negative, say, if the rater prefers the control item; positive if the rater prefers the experiment), and we aggregate these to form an overall score for the side-by-side. Then in much the same way that drops in CTR may block an experiment launch, a negative human evaluation score should also give much cause for concern.
Unfortunately, I don't have an easy way to generate data for a side-by-side (though I could perform a side-by-side on Amazon vs. Barnes & Noble), so I'll omit an example, but the idea should be pretty clear.
Here's another subtlety. In my examples above, I asked raters to pick a starting book themselves (one that they read and loved in the past year), and then rate whether they personally would want to read the related suggestions.
Another approach is to pick the starting books for the raters, and then have the rate the related suggestions more objectively, by trying to put themselves in the shoes of someone who'd be interested in the starting item.
Which approach is better? As you can probably guess, there's no clear answer – it depends on the task and goals at hand.
Pros of the first approach:
Pros of the second approach:
Let's review what I've discussed so far.
What are some of the benefits and applications?
In the spirit of item-item similarities, here are some other posts readers of this post might also want to read.
And finally, I'll end with a call for information. Do you run any evaluations, or use crowdsourcing platforms like Mechanical Turk or Crowdflower (whether for evaluation purposes or not)? Or do you want to? I'd love to talk to you to learn more about what you're doing, so please feel free to send me an email and hit me up!