Crowdsourcing is the practice of obtaining information or input into a task from a large number of people, either paid or unpaid, typically via the Internet. With its fast growth, crowdsourcing has produced large volumes of data manually labeled via human crowds. Processing this data with various machine learning algorithms, people expect meaningful information to meet their objectives. The authors refer to the “dehumanization effects” of crowdsourcing because both data collection and processing are carried out by machines.
Due to the open nature of crowdsourcing, the data collected is prone to biases for various human factors, such as age, country of residence, culture, ethics, gender, knowledge level, and so on. Data with human bias may affect the quality of information derived either positively or negatively. After providing strong evidence for skewed information caused by biased labels, the authors propose a labeling framework that takes human factors into consideration to improve the efficacy of crowdsourcing. The key idea of the framework is based on the following: different tasks have their specific preferences related to human factors. Therefore, a requester should specify different settings in the task transparently before launching a task. Making decisions about tradeoffs on such specifications is a kind of rehumanization.
Furthermore, because of the framework’s transparency, requesters are made aware of any potential issues introduced and can mitigate biases in the process at any point in time if a task is launched. Deploying the framework to a popular crowdsourcing platform in Python, the authors report “experiments with 1,919 workers collecting 160,345 human judgments.” The authors explain:
By routing microtasks to workers based on demographics and appropriate pay, our framework mitigates biases in the contributor sample and increases the hourly pay given to contributors.
The quality of crowdsourcing work depends on the quality of labels collected. While popular mobile computing devices and broadband networks make it easy to collect inputs from the public, the control of data quality has been a challenge. This paper provides a practical approach for managing human factors in crowdsourcing with convincing results. Researchers and practitioners working in the area of socially aware computing and machine learning should benefit from reading this paper.