Human crowds are valuable assets for providing additional responses in real time to cognate query results derived solely from relational database management systems (RDBMSs). But how should query results from human crowds, designed to simultaneously augment database query results, be terminated to provide reliable responses? Trushkowsky and colleagues offer statistical tools for users and developers of RDBMSs to use in scrutinizing the time and cost benefits of the accuracy and inclusiveness of query responses.
In an effort to cover the inclusiveness of all behavioral groups, the size of a query result (cardinality) is useful for computing the percentage of each interest survey group. The authors compellingly introduce an effective power law distribution data model that helps to overcome the data sampling problems attributable to cultural and regional biases, and a variety of the knowledgeable uses of web search strategies.
Clearly, the paper introduces and evaluates a metric for estimating the stable and convergent cardinality of “human intelligence tasks from Amazon’s mechanical Turk (HIT-AMT).” The authors introduce algorithms that help to minimize the influence of individuals who might dominate and bias query responses. They present the concepts of the classes of distributions of coverage and variance of user responses to crowdsourced queries. The experimental results of the test statistics with several thousands of queries in HIT with the AMT RDBMS of United Nations and US data show some significant improvement over well-known studies.
List walking is a situation when the total size of a query result is underpredicted due to multiple heavily skewed, or similar, survey responses. The authors propose and validate a heuristic binomial probabilistic algorithm to detect and overcome list walking. The algorithms successfully detected severe list walks in the United Nations database.
Undoubtedly, the authors present algorithms for computing cost-benefit tradeoffs of generating precise accounts and estimates of query responses derived from the traditional and real-time crowdsourced RDBMSs. There is no doubt that users should be empowered to contribute and reason about query results in all relational database management searches and retrieval results. In spite of the new light shed on the applications of the well-known power-laws [1] and the binomial distribution in this paper, I encourage all statisticians and database specialists to read and address the outstanding remaining unanswered questions raised by the authors. Is there a clear distinction between operations such as SELECT, JOIN, and PROJECT versus relational operators on real query results? What impacts do human behaviors have on the sampling process in crowdsourced queries?