As an alternative to visiting Web pages, really simple syndication (RSS) feeds offer a number of advantages: they come in an Extensible Markup Language (XML)-based format, they include content-based meta-information, and they can be aggregated and presented in a variety of ways. In particular, users who are interested in obtaining a quick overview of events in a particular domain can do this by combining related RSS feeds from various sources and displaying the headings of individual items (and some content, if desired) on one page.
One problem with such RSS feeds is the semantic overlap: news items from the same source may be offered through multiple RSS feeds, leading to duplicate entries, and items covering the same event may have a significant overlap in content, even though their presentation is different. While outright duplication is fairly straightforward to resolve, the semantic overlap of similar items is a more challenging problem. The authors of this paper propose an approach to merge similar items based on their relatedness.
For human readers, it is often sufficient to glance at the heading and maybe the first few sentences of two or more news items to determine if they are related; for computers, this is a challenge. Computers treat RSS entries as structured entities consisting of strings that are disassociated from the meaning that humans attribute to the words specified by these strings; determining the topic and content of an item requires additional measures. RSS items often have tags (keywords) that describe their content. An informal or formal structure can be used to describe the relations between tags (such as taxonomies, folksonomies, and ontologies). It is assumed that RSS entries with a large overlap in their label sets are related, and the additional structure allows an expansion or refinement of the relatedness between items.
Tags are usually created by the item’s author(s); by themselves, they are not a sufficiently reliable basis to determine the relatedness between items. Thus, Taddesse et al. expand the calculation of relatedness to the actual contents of the items. The goal is to group similar items together and to possibly merge some of their constituent components into one overall item, to be presented to the reader (the provenance of the individual pieces is still traceable).
Taddesse et al.’s experiments on two sets of RSS items show that the core method works reasonably well. Due to the overall difficulty of the relatedness and merging problems, they’re working on a full merging language that includes user preferences and the expansion of their methods to other media types.