IT博客汇 | Making the Most of Mechanical Turk: Tips and Best Practices

(Update: we recently open-sourced Clockwork Raven, one of the human evaluation tools on top of Mechanical Turk that we built at Twitter.)

Big data’s all the rage, but sometimes a couple thousand human-generated labels can be pretty effective as well. And since I’ve been using Amazon’s Mechanical Turk system a lot recently, I figured I’d share some of the things I’ve learned.

What is MTurk?

Mechanical Turk is a crowdsourcing system developed by Amazon that connects you to a relatively cheap source of human labor on the fly.

For example, suppose you have 10,000 websites that you want to classify as spam or not. To get these classifications, you (the Requester):

Create a CSV file containing the links and any other information.
Log onto MTurk and create a HIT (Human Intelligence Task) describing the job (possibly by using Amazon’s WYSIWYG editor or writing your own HTML, which can refer to columns in your CSV). [There’s also an MTurk API, if you don’t want to use the terrible UI.]
Within hours of starting the task, your judgments will be completed by Turkers around the world for pennies each.

More Example Tasks

So what can you use MTurk for? Here are three of my favorite uses:

A Sequence of Lines Consecutively Traced by Five Hundred Individuals

The Sheep Market: asking Turkers to draw sheep

Blurry Text Transcription (Seriously! How is this possible?!)

And here are some more practical tasks, from HITs running right now:

Categorize the sentiment of a tweet towards Panera Bread

Copy text from a business card

Judge entity relatedness

Increasing the quality of your judgments

So what will the quality of your judgments look like?

If you don’t do anything special, then your output will contain a lot of garbage. I’ve thrown out entire tasks because of scammers who spend less than 5 seconds on each judgment (Amazon records the time each worker spends) and submit random clicks as output (e.g., labeling Nike as a food category).

Luckily, Amazon provides a few worker filters:

You can require that only Turkers who have received at least (say) 99% approval rate on at least 10,000 judgments in the past are allowed to work on your judgment. (If you see bad judgments from a worker, you can reject them and get your money back.)
About a year ago, Amazon launched a “categorization masters” and “photo masters” program, which allows only masters to work on your HITs. According to a chat with a member of the MTurk team, Amazon assigns these master badges by creating special tasks (anonymously, and for which Amazon already knows the answer) and measuring the quality of each worker’s response to these tasks.
You can also create a custom filter and handpick who gets allowed to work for you, or set up a qualification test that workers are required to take before working on your tasks.

I’ve used different combinations of the first two filters, and gotten excellent results – compared to in-house judges I’ve worked with in person and paid \$20-30 an hour, the judgments on Mechanical Turk have been just as good and sometimes even better. (I often ask my judges to explain their judgments, which makes it easy to detect high quality workers.) For example, here are some typical response I’ve received when asking judges to determine which of two products a given Twitter user might be more interested in:

The user is a female obsessed with Twilight Movies and Rob Pattinson. She tweets and follows both subjects. Movie tickets would be interesting to her.

He doesn’t seem to play video games, and he doesn’t seem technical enough to care about running Windows on a Mac. Neither of these products are a good fit for him.

In fact, I’ll frequently also get emails from Turkers giving me suggestions on how to improve my tasks or asking how they can do them better. (Amazon allows workers to email you. The only way for the requester to initiate a conversation, though, is by paying the worker a small bonus for excellent work, and including a message with the bonus.) Here are excerpts from some emails I’ve received:

I just wanted to check in to be sure that once I figured things out that I was doing your hits the way you intended them to be done. I want to be sure that you are getting the data that you need from the work. Please do not hesitate to let me know if there is anything that I can do to improve the way I am working your HITs. This is my full time job while I stay at home with my kids, so I like to check with the requesters to be sure that I am putting out the work that they are looking for. Any suggestion is welcome.

Frankly, lingerie, makeup, and feminine hygiene are the only male-exclusionary topics I can think of, and it feels knee-jerk sexist to mark any sports-related site for men. That said, should I hew more closely to gender stereotypes or be politically correct? (from a HIT where I was gathering gender classification data)

I do think a few more categories are needed but keeping the number down overall is good - 50 or 60 to choose from can be overwhelming and not worth the time. I may have mentioned I never used the Photography one (and I did a lot of those) so that is a good candidate for elimination.

That said, despite the approval rate filters and masters badges, I do occasionally get a couple scammers in the mix (or even just judges who don’t produce as excellent work). So one suggestion is to run an initial task with these filters applied, find the workers with the best quality, and from then on use a custom pool containing these Turkers alone.

How much to pay

So how much should you pay your workers?

New Turkers and Turkers who don’t meet the strict filters can be paid less, but most of my high-quality workers expect to make about \$8-14 an hour. (You can only specify how much you pay per judgment, but Amazon will tell you how long each item ends up taking on average.) For example, here’s what several Turkers said what I asked them directly how much they make:

Most of the work I do is either writing or editing. When editing work is available, I make \$15-20 per hour. I’m a slower writer than an editor, so I average \$10-12 per hour with writing. I also judge sentiments of messages and average about \$8 per hour with that type of work. I would like to average a minimum of no less than \$8 per hour.
A big factor in deciding to do a task or not comes from the time investment involved. The two big time sinks are either googling/searching/having to go to another site, or having to write something as part of your reply. If I remember correctly, a) your tasks did require looking at another page but either the link was right there OR, better yet, you had that page embedded in the HIT itself so clicking out of the window wasn’t necessary (turkers get very excited about this), and b) the quality of the pay rate was such that it easily outweighed the time it took to leave an explanatory comment.
For me at least, those things can’t be underestimated. Sure, your tasks may be a little time-consuming, but I figure a good task is one I can make 10 to 12 cents a minute on. Your task might take longer but I’m definitely still coming out ahead.
From my own experience, I work hardest and best for a requester that pays well and doesn’t reject (or at least seems to have a reason for a rejection when it happens). If a requester is going to accept the majority of my work, I as a worker feel that obligates me to provide them with the best quality possible. Similarly, although I’m conscientious with all tasks, I’m especially so with a high-paying one: it would be easy to take advantage of a high-pay, low-reject requester - which would ultimately lead them to either lower the pay or change the acceptance criteria. I don’t want that!! That’s the kind of requester I want around. I’m grateful for high pay and fair policies and that kind of requester gets an above-and-beyond effort from me.

For the pay, I have worked on master’s hits that have ranged from \$6-\$16 per hour. Averaging them out works out to around \$9, which isn’t a bad wage. I have two requesters that I work for that don’t use the master qualification but instead have closed qualifications that they’ve assigned to their best workers. Those tasks pay between \$12 and \$15 per hour, so no matter what I’m working on I will stop what I’m doing to work on them. The best paying hits are always done very quickly, so most of the time if you check out mturk and look at the tasks available you won’t get a very good idea of average pay because the terrible paying hits will sit on the board until they expire.

Obviously, this is self-reported, so there’s a strong possibility that the Turkers are artificially inflating their numbers. But this does match what I’ve been told by a manager on the MTurk team, as well as what Turkers self-report on TurkerNation.

A good suggestion regarding pay is to start at the lower end of the scale, around \$6-8 per hour, and increase that until you get both the quality and speed you want.

Other design tips

Interestingly, according to what Turkers (see the excerpts above) and my Amazon contact say, as well as other research I’ve seen (e.g., this paper), pay is not at the absolute forefront of Turkers’ minds when they decide what to work on. Instead, they focus more on requesters they’ve already established a good relationship with, HITs with many items (so they can quickly settle into a rhythm), HITs they know they’ll be paid for (so they’re not worried about rejections), and HITs that they generally enjoy doing more.

So here are a couple suggestions:

If your task is hard and there’s no clearly correct answer, even good Turkers might be worried that you’ll reject their judgments (and so they might skip over your HIT). So make it clear in your instructions that you won’t reject any judgments, or that you won’t reject any judgments with an honest effort.
Make your instructions collapsible, or link to them in a separate site. Scrolling is kind of annoying on Mechanical Turk (I know – I’ve tried working on HITs myself), so you should minimize the amount workers have to scroll. Ideally, everything fits on a single screen. Plus, the less workers have to scroll, the faster your HITs will get done. For example, here are excerpts from emails I received from two different Turkers when I first started out:

I have a suggestion that would really make things go a little quicker. Is there anyway you could script the twitter link to automatically open in a new tab? It amazes me how much it can slow you down to have to right click and open it manually in another tab, and when you forget, you have to take a few more steps to get back to where you were.

It would be amazing if the Twitter account could be on the same page instead of having to click to get to another screen - the work would go *exponentially* faster! Overall, I’m enjoying them - and I’m not the only one. Despite your stringent requirements these are disappearing pretty quickly.

Introduce yourself on TurkerNation, a forum where Turkers and Requesters go to talk about Mechanical Turk. This helps establish your reputation as a good requester who listens to feedback, which will make good Turkers want to work for you. (More on this below.)
Approve judgments quickly: Turkers want money now instead of money later. For example, one worker told me:

Quick approval is important, too. Watching that money pile up is a serious motivator; I’ll sometimes choose a lower-paying task that approves in close to real time over a higher-paying one that won’t pay out for several days.

When using my trusted set of workers, I let Amazon auto-approve all judgments within a couple hours.

Reputation

Reputation is pretty important. Turkers love requesters who take the time to respond to emails and incorporate suggestions. Excerpts from emails I’ve received:

I LOVE it when requesters care enough to ask the opinion of us lowly turkers and am more than willing to take a few minutes to help them with anything. I look forward to seeing what you cook up!

Thanks for taking the time to try to make your hits better in both pay and design. It’s great to see a requester that actually cares, when most don’t. If you have any other questions for me, feel free to ask. I hope to work for you again soon.

From my own experience, I work hardest and best for a requester that pays well and doesn’t reject (or at least seems to have a reason for a rejection when it happens). If a requester is going to accept the majority of my work, I as a worker feel that obligates me to provide them with the best quality possible. Similarly, although I’m conscientious with all tasks, I’m especially so with a high-paying one: it would be easy to take advantage of a high-pay, low-reject requester - which would ultimately lead them to either lower the pay or change the acceptance criteria. I don’t want that!! That’s the kind of requester I want around. I’m grateful for high pay and fair policies and that kind of requester gets an above-and-beyond effort from me.

I’ve gotten great suggestions from a lot of Turkers (sometimes, when launching a new type of experiment, I’ll do a quick trial run in order to get some fast feedback before spending more time on the HIT design), and I suspect it’s partly because I’ve taken the time to connect with my workers.

So, as suggested above, one way of quickly garnering some goodwill when you’re first getting started is to make a post introducing yourself on TurkerNation. (There’s a sub-forum devoted to this exact purpose, in fact.)

This is useful because workers will often start new threads recommending particular requesters and encouraging other Turkers to work for them. In the amusing thread praising me, for example, one worker mentioned that she’d been hesitant to work on my HITs until she saw the post confirming I was a good requester.

Also, many Turkers mention that they always refer to Turkopticon, a Firefox extension that displays ratings of requesters by other Turkers, before accepting work from a requester they haven’t worked for before.

This is what TurkOpticon looks like:

Here are some comments about TurkOpticon on TurkerNation:

I think that it is well worth taking the time to check reputation of requesters via TurkOpticon and/or in this forum. Checking first substantially minimizes your risk of rejection, of being blocked, and of being paid sub-human wages.

Blindly doing hits for requesters that were never heard of before got me with a pretty bad approval rate when I first started turking. After that, I rigorously inspect every requester that doesn’t have any ratings on Turkopticon. Actually, because of that little add-on I’ve been able to maintain a steady 98-99% approval rate ever since I began using it.

Waiting Time

So how long does it take to get judgments? I’ve restricted the available worker pool pretty strongly to ensure high quality, and it’s still only taken a few hours to get a thousand judgments.

That’s pretty awesome. I’ve worked a lot with human evaluation systems before, but always using a small in-house set of judges – and what with constraints on when those judges were available, how much they were able to work each week, and other tasks taking higher priority, it’d invariably take at least a few days before I’d receive any useful data back.

Getting thousands of judgments in a couple hours means I can launch an MTurk task when I leave for work in the morning and have it done before lunch, which makes experimenting with a lot of different ideas much faster and easier.

Scale

So how many judgments can you actually get before you run out of workers? I’m still a small fish in the MTurk system, but I’m told by my MTurk contact at Amazon that there are companies getting over a million judgments each month.

I also asked my pool of workers how much they’re available to work, in case I would need to scale up to more judgments later on, and here are some samples from what they said:

Typically, I work a total of 20-25 hours per week for a small select group of requesters. I could put in at least 20 hours per week for you alone if you were to make a custom qualification for me. If I know that I can continue to do exemplary work beyond 20 hours, I would be willing to put forth more hours of work. I want to make sure that you are getting the quality of work that you need.

On a day when I don’t have those other assignments, I’d guess I’m turking 5 to 7 hours a day (including weekends). I like to look for a large batch of HITs (preferably in the thousands) so that I can settle into a groove of being able to do them fairly quickly and once I find something like that I can happily settle in for several hours at a time.

I spend more time than the average person on mturk. I log on at about 5:30 AM and am constantly checking for work throughout the day. If the work is available, I will spend until 9PM working. Granted, I do have to take some breaks throughout the day to take care of my 3 year old, but for the most part, I am doing my best to earn while the hits are posted. If I take any time off, it is on the weekend (if I reach my earning goals for the week).

Of course, how much I can work varies. My main source of income is transcription for a market research company and mturk fills in my downtime. If I have an audio file from them, that gets my attention. If not, I’m on mturk. As a single mother working from home, I love the flexibility.

End

I’ll end with a couple other notes.

How do other companies use human evaluation systems? Google and Bing use human judgments in their search metrics, though I think they use an in-house set of judges rather than Mechanical Turk. I’ve heard Aardvark and Quora used Mechanical Turk to seed answers when they first launched their sites. There’s also a nice set of case studies here (search for the “On-Demand Workforce” section); in particular, Knewton’s use of MTurk for performance and QA testing is pretty interesting.
I’ve described one way of finding good workers, namely, using the filters Amazon provides. Another way could be to build a reputation system yourself, perhaps using an EM-style algorithm to determine judge quality.
Crowdflower is another crowdsourcing system. There are a couple differences with MTurk:
- Crowdflower’s worker pool consists of about 20 different sources, including Mechanical Turk, as well as sources like TrialPay (people can opt to complete a MTurk task to receive some kind of TrialPay deal).
- Crowdflower offers both a self-serve platform (like MTurk), as well as a more enterprise-centric solution (where you work directly with a Crowdflower employee). The enterprise offering is pretty nice, since that means Crowdflower will take care of the lower-level details for you (like actually designing and creating the job), and they can offer suggestions for designing the HIT based on their experience.
- Crowdflower provides the option of adding gold standard judgments to your task (items where you provide a golden answer, which are then randomly shown to workers; these are then used to monitor judges) and they try to automatically determine judge quality and item accuracy for you (e.g., by having each item judged by three different workers).
An excellent crowdsourcing resource is CrowdScope. I also like the Deneme blog (though it hasn’t been updated in a while) for a lot of fun experiments. Panos Ipeirotis’ blog has good information as well.