ComputingReviews.com

A combination approach to Web user profiling
Tang J., Yao L., Zhang D., Zhang J. ACM Transactions on Knowledge Discovery from Data5(1):1-44,2010.Type:Article

Date Reviewed: 04/06/11

The Web contains rich sets of information about researchers in all areas. It would be extremely useful to be able to retrieve an accurate and complete profile of a researcher by simply supplying a name to a search engine. In this paper, Tang et al. describe their exciting work in ArnetMiner (http://www.arnetminer.org/), a tool that accomplishes this task very well.

ArnetMiner is a search engine dedicated to providing information about academic publications and their authors. This paper presents an exercise that uses the publication database in ArnetMiner to accurately retrieve profiles of authors from varying sources and present them to the user in a unified format. The general approach is to retrieve information from the database, such as the author’s name, publication, and affiliation, and then to mine that information in order to extract connections among the authors, their publications, and their affiliations. The authors’ work involves three major tasks: profile extraction, name disambiguation, and user interest compilation from the collected data.

Three steps are taken to extract the profile of a researcher: page finding, pre-processing, and tagging. A collection of initial Web pages related to the researcher--the result of a Google search--is used as a starting point. Support vector machines (SVMs) are used as a classifier to extract features that contain the person’s information, such as a title, email, or uniform resource locator (URL). The retrieved Web pages are broken into tokens, which in turn are tagged using a trained tagging model. The tagging model contains a set of predefined terms, such as doctoral major, address, publication venue, and research interest. The result is a network of researcher names, affiliations, publications, and research interests.

The problem of name ambiguity comes from the fact that some researchers share the exact same name. The goal of name disambiguation is to identify the correct person through relevant information, such as affiliations and research interests. In Tang et al.’s work, publication data is used for name disambiguation. The team extracted six attributes for each publication: paper title, publication venue, publication year, abstract, author names, and references. A hidden Markov random field method is used for the task, which essentially uses conditional probability to estimate the likelihood of an author with a particular name who appears on a particular paper.

As mentioned above, the third task of this paper is to analyze a user’s research interest. The data available for this task comes from the results of the previous two tasks (the extracted and integrated user profiles, and the collection of publications relating to each user). Tang et al. use the author-conference-topic (ACT) model, which “utilizes the topic distribution to represent the interdependencies among authors, papers, and publication venues.” Collectively, “the conference information [venue, publications, and authors] is associated with each word as a stamp.” The authors explain the basic idea: “When preparing a paper, an author would be responsible for a word; he writes the word based on his research interests ... then each topic in this paper determines a proportion on where to publish the paper.”

The paper presents experimental results that show significant improvement over existing models in all three tasks (profile extraction, name disambiguation, and research interest finding). The paper is well written and presents a nice balance of theories and implementation details.

Reviewer: Xiannong Meng

Review #: CR138963 (1110-1075)

Reproduction in whole or in part without permission is prohibited. Copyright 2024 ComputingReviews.com™
Terms of Use | Privacy Policy