In information retrieval, clustering aims to improve user search experience through grouping objects. For example, it can be used for grouping search results so that users can quickly retrieve the interrelated items together. Recently, especially due to social bookmarking, it has become possible to have user-assigned descriptive tags in addition to page text for some web pages. Such user-generated content can be used to enrich the description of pages and improve clustering performance so that clusters become more cohesive.
In this study, the authors aim to obtain highly discriminative features based on page text and user tags, and improve clustering performance. For this purpose, they develop a method based on the multiview learning concept and a technique called kernel canonical correlation analysis. Their approach aims to show that, by considering the correlation between page text and tags, it is possible to obtain better features that would improve clustering performance. The authors consider both partially and fully tagged corpora, and in the experiments employ various combinations of page text and tag information (such as page text only, tags only, and their combination). They experiment with 2,000 tagged Open Directory Project (ODP) web pages. They show that their approach improves clustering performance.
The study is useful for clustering objects defined by more than one set of features. The authors provide several future research pointers, such as clustering medical records and multilingual data; however, understanding and using the method requires knowledge of the domains and languages.