“Data in the wild” refers to raw data that can be gathered from social networking and microblogging sites, as well as other sites featuring user-generated content. In this brief paper, the authors discuss the problems associated with gathering data in the wild for research purposes.
The authors note that doing research based on data in the wild deviates from the usual design of a research project. This type of data is not constructed and designed with research questions in mind. Research is conducted by a means over which the data investigator has little control. Answers to typical survey questions such as the gender or educational background of participants are difficult to ascertain from data in the wild.
Scientific research generally involves the development of a hypothesis, the selection of a means of measuring the hypothesis, and the analysis of the gathered data. In the social sciences, this is typically done with surveys specifically developed for the research purpose. With data in the wild, researchers mine existing data for patterns and then work backward to develop hypotheses about what they see.
There are also potential ethical issues when conducting research on data in the wild. Subjects cannot be informed about the type of research that is performed on their data, and informed consent about their participation is impossible.
Finally, the authors note legal problems associated with gathering social data for research purposes. Some of the authors tried doing research on self-disclosure and privacy by sampling screen shots from live webcams from a publicly available social networking site. The authors quickly realized they could inadvertently collect data that could be either illegal or disturbing (such as pornography, child pornography, and so on) and abandoned their research project.
The authors call for multidisciplinary research involving law, computer science, social science, and humanities to address the concerns discussed in this paper. They note a need to develop guidelines for conducting research with data in the wild.