Computing Reviews, the leading online review service for computing literature.

Search

Adversarial Information Retrieval: The Manipulation of Web Content

Dennis Fetterly
Microsoft Research

1. Introduction

Adversarial information retrieval has become a very active research area in the last few years. As the name indicates, adversarial information retrieval research differs from traditional information retrieval in that the content providers may have an adversarial relationship with the entity (usually a search engine) that consumes their content. The fraction of Web page referrals that come from search engines is significant, which has created an economic incentive for Web site operators to attempt to manipulate search engine rankings. This manipulation, which is often called Web spam or spamdexing, can take many forms, all of which are challenging to identify. These forms include content manipulation, link creation and manipulation, cloaking, click fraud, and tag spam. Similar to conventional information retrieval, adversarial information retrieval is an interesting area to work in because of the nature of the problem, and the potential for novel research to have a positive impact on the search results that are sent to millions of users.

The results of research into adversarial information retrieval enable search engines to identify content that attempts to game search engine rankings. Search engines need to identify this content in order to maintain a positive user experience. Once this content has been identified, those documents or sites can be penalized by the search engine in a variety of ways. The penalties range in severity from being suppressed in results pages, being stripped of any endorsement power, or actively being purged from the search engine’s index. A recent Pew Internet study [1] found that 55 percent of people had lost trust in email as a result of email spam. This study also found that 19 percent of users have reduced their email usage as a result of spam messages. The presence of Web spam in search results could have a similar impact on people’s trust in those results if the issue is not addressed satisfactorily. This, coupled with the low cost of switching between search engines, provides significant motivation to search engines to thwart any attempted manipulation of result rankings.

One of the many interesting challenges that this area shares with traditional information retrieval is the scale at which algorithms must correctly operate. Given the infinite nature of some Web sites, especially spam sites that return a new randomly generated page for every request they receive, it is impossible to accurately count the number of pages on the Web today, much less the number of spam Web pages. All of the major search engines currently select billions or tens of billions of documents from this infinite collection to index. Recent studies have shown that 13.8 percent [2] of English language Web pages were classified as spam pages, which gives an idea of the extensive scope of the problem and suggests that solutions that require manual intervention will not scale.

2. Content and Link Manipulation

One major issue in this area is the ease, both administratively and economically, with which adversaries can create content to attempt to manipulate the search engine results. In vanilla PageRank, every page has a small endorsement power and there is extremely little cost to create additional pages on a Web server. The cost is incurred in registering an Internet domain name and obtaining an Internet protocol (IP) address for a Web server. These costs can be amortized over a large number of hosts and pages by mapping multiple domains to a single IP address and creating a large number of hostnames or subdomains within the registered Internet domain name. Even minimal hosting costs can be avoided by enterprising spammers who have hosted their spam sites on one of the many blog hosting sites that provide free hosting. In addition to spam blogs, these services are also under a deluge of comment spam, where scripts post comments containing link spam on blogs, guestbooks, and wikis.

An additional related issue is one of trust and reputation, which is an issue for many areas in computer science. In almost all cases, it is either free or extremely inexpensive to create additional content to be consumed by the search engine, and there is no centralized authority, which is often used as the basis for a trust or reputation system. Some approaches attempt to approximate a centralized authority by focusing on scarce resources, such as domain name registrations or IP addresses, but these approaches don’t completely address the issue.

There are many different types of spam Web pages, some of which are described in this essay. At the First International Workshop on Adversarial Information Retrieval on the Web, Gyöngyi and Garcia-Molina published a taxonomy of spam Web pages [3], which describes these variants in more detail. There could be spurious content in the document, such as additional terms in the meta tag that are unrelated to the document itself, or content could be duplicated multiple times on a page. Pages could be generated by completing a template with particular words, Mad Libs style, choosing words at random, or stitching a page together from sentences chosen at random. There has been a significant body of research aimed at detecting spam Web pages by analyzing their content [2], templatic structure [4,5], and phrase-level replication [6]. Duplicate content, especially at the partial document level, is a large open problem in this research area. It is extremely difficult or impossible to automatically determine the provenance of duplicated content. If the search engine chooses to penalize a page due to duplicate content, which site should be considered the originator of that content?

3. Cloaking

The adversarial Web server could also be employing cloaking, which is an adversarial technique where different content is served to the search engine’s crawler and the end user’s Web browser. The page could also contain scripts that would modify the document at load time. As a result, search engine crawlers that do not implement a script interpreter would index one version of the content, while users with a script interpreter would see another version of the document. Users who visit a spam page could also have their browsers redirected to another page via meta refresh tags or scripts. Wu and Davison performed a preliminary study of cloaking and redirection [7] where they compared pages downloaded using clients that either identified themselves as a search engine crawler or a standard Web browser. Research by Chellapilla and Chickering analyzes search query logs and advertising click logs to show that results for popular queries are more likely to be cloaked, and that probability increases if the query is one likely to generate advertising revenue [8].

4. Link-based spam detection

There are also various link-based spam detection techniques. New ranking algorithms such as TrustRank [9] and AntiTrustRank [10] mitigate the impact of spam Web pages by propagating trust from initial trusted or distrusted seed uniform resource locators (URLs). Benczúr et al. [11] also explore propagating trust and distrust from a seed set using link-based similarity features instead of ranking. In other recent work, Gyöngyi et al. attempt to estimate the impact of link spam on a page’s ranking [12], but their algorithm uses a host-level graph, which significantly reduces the resource requirements. Gibson et al. present an algorithm for identifying dense subgraphs in the Web graph, many of which are the result of link spam [13].

Research addressing blog comment spam has included work utilizing language models of the original blog post, the comment, and pages that the comment links to [14]. More recently, there has been research to identify spam blogs, or splogs, as well [15]. Later results from the same research group find that the incidence rate of spam blogs in popular blog search engines is as high as 20 percent. It is worth noting that the spam pages in this sample have all passed the search engine’s spam filters before they are displayed as results, so this is a conservative estimate of the actual amount of this type of spam.

5. Conclusion

A common trend in many of the recent approaches to detecting Web spam is to combine a number of features together using machine learning classifiers to improve the overall performance of the spam classifier. Another recent development in this area is the public availability of a test collection, which enables greater participation in the field, as well as objective comparisons between algorithms from different research groups. Becchetti et al. [16] advocate that search engines develop a clear set of rules and equate these rules to the “anti-doping rules” in sports competitions.

A key part of the scientific process is the reproducibility of research results, which requires that the process used to obtain those results be explained in sufficient detail to reproduce them. One obstacle facing this research area that is different from many other communities in computer science is that publishing research results communicates the new results to the adversaries as well as the researchers. The computer security discipline is another area where a similar tension exists and is discussed under the term “Responsible Disclosure.” Of course, this publishing dilemma is also significant outside of computer science. In fields where the risks are potentially life threatening, the responsibility for determining what to publish and what to withhold is jointly shared by the authors and editors, and has been the subject of ongoing discussion.

Created: July 10 2007
Last updated: July 10 2007

Web Pages

Adversarial Information Retrieval on Wikipedia: provides a basic description of adversarial IR.

WEBSPAM-UK2006 Dataset: a dataset collected in 2006 containing classifications of Web pages as normal, spam, or borderline. It is useful for people who want to attack the Web spam problem, but who lack the resources to perform large-scale collection and labeling on their own.

Web Spam Challenge: a competition whose goal is to identify and compare machine learning methods for automatically detecting Web spam.

Articles

Topical TrustRank: using topicality to combat Web spam Wu B., Goel V., Davison B. WWW2006

Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2007), Banff, Alberta, Canada, May 8th, 2007

Books

Google’s PageRank and beyond: the science of search engine rankings Langville A., Meyer C., 2006

Conferences and Workshops

ACM Conference on Electronic Commerce (ACM EC): an annual conference sponsored by the ACM Special Interest Group on Electronic Commerce (SIGECOM)

Adversarial Information Retrieval on the Web Workshop (AIRWeb): a workshop for researchers and practitioners that covers advances in Web-based adversarial information retrieval

ACM SIGIR International Conference (SIGIR): an annual conference that focuses on new research and developments in the field of information retrieval

International WWW Conference (WWW): an annual conference aimed at researchers, developers, and users that covers “the evolution of the Web, the standardization of its associated technologies, and the impact of those technologies on society and culture”

Reviews

Google’s PageRank and beyond: the science of search engine rankings
Langville A., Meyer C., 2006

Inside PageRank
Bianchini M., Gori M., Scarselli F. ACM Transactions on Internet Technology 5(1): 92-128, 2005

A large-scale study of the evolution of Web pages
Fetterly D., Manasse M., Najork M., Wiener J. Software--Practice & Experience 34(2): 213-237, 2004


1)	Fallows, D. Spam 2007. Pew Internet & American Life Project, May (2007).
2)	Ntoulas, A., Najork, M., Manasse, M., Fetterly, D. Detecting spam Web pages through content analysis. In Proc. of the 15th International Conference on World Wide Web (WWW2006), ACM Press (2006), 83-92.
3)	Gyöngyi, Z., Garcia-Molina, H. Web spam taxonomy. In Proc. of the 1st International Workshop on Adversarial Information Retrieval (AIRWeb), Lehigh University (2005), Article No. 5.
4)	Fetterly, D., Manasse, M., Najork, M. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proc. of the 7th International Workshop on the Web and Databases (WebDB), ACM Press (2004), 1-6.
5)	Urvoy, T., Lavergne, T., Filoche, P. Tracking Web spam with hidden style similarity. In Proc. of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Lehigh University (2006), 25-31.
6)	Fetterly, D., Manasse, M., Najork, M. Detecting phrase-level duplication on the World Wide Web. In Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press (2005), 170-177.
7)	Wu, B. and Davison, B. Cloaking and redirection: a preliminary study. In Proc. of the 1st International Workshop on Adversarial Information Retrieval (AIRWeb), Lehigh University (2005), Article No. 2.
8)	Chellapilla, K., Chickering, D. Improving cloaking detection using search query popularity and monetizability. In Proc. of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Lehigh University (2006), 17-23.
9)	Gyöngyi, Z., Garcia-Molina, H., Pedersen, J. Combating Web spam with TrustRank. In Proc. of the 30th International Conference on Very Large Data Bases (VLDB), Morgan Kaufmann (2004), 576-587.
10)	Krishnan, V., Raj, R. Web spam detection with anti-trust rank (2006). In Proc. of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Lehigh University (2006), 37-40.
11)	Benczúr, A., Csalogány, K., Sarlós, T. Link-based similarity search to fight Web spam. In Proc. of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Lehigh University (2006), 9-16.
12)	Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. Link spam detection based on mass estimation. In Proc. of the 32nd International Conference on Very Large Data Bases (VLDB), ACM Press (2006), 439-450.
13)	Gibson, D., Kumar, R., Tomkins, A. Discovering large dense subgraphs in massive graphs. In Proc. of the 31st International Conference on Very Large Data Bases (VLDB), ACM Press (2005), 721-732.
14)	Mishne, G., Carmel, D., Lempel, R. Blocking blog spam with language model disagreement. In Proc. of the 1st International Workshop on Adversarial Information Retrieval (AIRWeb), Lehigh University (2005), Article No. 1.
15)	Kolari, P., Java, A., Finin, T., Oates, T., Joshi, A. Detecting spam blogs: a machine learning approach. In Proc. of the 21st National Conference on Artificial Intelligence (AAAI 2006), AAAI (2006).
16)	Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R. Link-based characterization and detection of Web spam. In Proc. of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Lehigh University (2006), 1-8.

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy