Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
A combination approach to Web user profiling
Tang J., Yao L., Zhang D., Zhang J. ACM Transactions on Knowledge Discovery from Data5 (1):1-44,2010.Type:Article
Date Reviewed: Apr 6 2011

The Web contains rich sets of information about researchers in all areas. It would be extremely useful to be able to retrieve an accurate and complete profile of a researcher by simply supplying a name to a search engine. In this paper, Tang et al. describe their exciting work in ArnetMiner (http://www.arnetminer.org/), a tool that accomplishes this task very well.

ArnetMiner is a search engine dedicated to providing information about academic publications and their authors. This paper presents an exercise that uses the publication database in ArnetMiner to accurately retrieve profiles of authors from varying sources and present them to the user in a unified format. The general approach is to retrieve information from the database, such as the author’s name, publication, and affiliation, and then to mine that information in order to extract connections among the authors, their publications, and their affiliations. The authors’ work involves three major tasks: profile extraction, name disambiguation, and user interest compilation from the collected data.

Three steps are taken to extract the profile of a researcher: page finding, pre-processing, and tagging. A collection of initial Web pages related to the researcher--the result of a Google search--is used as a starting point. Support vector machines (SVMs) are used as a classifier to extract features that contain the person’s information, such as a title, email, or uniform resource locator (URL). The retrieved Web pages are broken into tokens, which in turn are tagged using a trained tagging model. The tagging model contains a set of predefined terms, such as doctoral major, address, publication venue, and research interest. The result is a network of researcher names, affiliations, publications, and research interests.

The problem of name ambiguity comes from the fact that some researchers share the exact same name. The goal of name disambiguation is to identify the correct person through relevant information, such as affiliations and research interests. In Tang et al.’s work, publication data is used for name disambiguation. The team extracted six attributes for each publication: paper title, publication venue, publication year, abstract, author names, and references. A hidden Markov random field method is used for the task, which essentially uses conditional probability to estimate the likelihood of an author with a particular name who appears on a particular paper.

As mentioned above, the third task of this paper is to analyze a user’s research interest. The data available for this task comes from the results of the previous two tasks (the extracted and integrated user profiles, and the collection of publications relating to each user). Tang et al. use the author-conference-topic (ACT) model, which “utilizes the topic distribution to represent the interdependencies among authors, papers, and publication venues.” Collectively, “the conference information [venue, publications, and authors] is associated with each word as a stamp.” The authors explain the basic idea: “When preparing a paper, an author would be responsible for a word; he writes the word based on his research interests ... then each topic in this paper determines a proportion on where to publish the paper.”

The paper presents experimental results that show significant improvement over existing models in all three tasks (profile extraction, name disambiguation, and research interest finding). The paper is well written and presents a nice balance of theories and implementation details.

Reviewer:  Xiannong Meng Review #: CR138963 (1110-1075)
Bookmark and Share
  Featured Reviewer  
 
Information Search And Retrieval (H.3.3 )
 
 
Database Applications (H.2.8 )
 
 
General (H.4.0 )
 
Would you recommend this review?
yes
no
Other reviews under "Information Search And Retrieval": Date
Nested transactions in a combined IRS-DBMS architecture
Schek H. (ed)  Research and development in information retrieval (, King’s College, Cambridge,701984. Type: Proceedings
Nov 1 1985
An integrated fact/document information system for office automation
Ozkarahan E., Can F. (ed) Information Technology Research Development Applications 3(3): 142-156, 1984. Type: Article
Oct 1 1985
Access methods for text
Faloutsos C. ACM Computing Surveys 17(1): 49-74, 1985. Type: Article
Jan 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy