Quality Management on Amazon Mechanical Turk

Crowdsourcing services, such as Amazon Mechanical Turk, allow for easy distribution of small tasks to a large number of workers.  Unfortunately, since manually verifying the quality of the submitted results is hard, malicious workers often take advantage of the verification difficulty and submit answers of low quality.  Currently, most requesters rely on redundancy to identify the correct answers.  However, redundancy is not a panacea. Massive redundancy is expensive, increasing significantly the cost of crowdsourced solutions.  Therefore, we need techniques that will accurately estimate the quality of the workers, allowing for the rejection and blocking of the low-performing workers and spammers.

 

However, existing techniques cannot separate the true (unrecoverable) error rate from the (recoverable) biases that some workers exhibit.  This lack of separation leads to incorrect assessments of a worker’s quality.  We present algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error.  Our algorithm generates a scalar score representing the inherent quality of each worker.  We illustrate how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsupervised and supervised techniques for inferring the quality of the workers.  We present experimental results demonstrating the performance of the proposed algorithm under a variety of settings.