The retrievable units in Extensible Markup Language (XML) documents are individual elements, and the length distribution of such elements is different from that of standard documents. Therefore, document length normalization in XML retrieval needs to take a different approach than that in standard document retrieval. In this paper, the authors investigate the issue of document length normalization in the XML retrieval context, by analyzing the length distributions of XML elements, and carrying out an experiment investigating length normalization techniques in XML retrieval.
The element length analysis indicates that, although the distribution of arbitrary elements is skewed toward short elements, the distribution of relevant elements is fairly even, except in the case of the shortest elements. In addition, the length distribution of prior probability of relevant elements is heavily skewed toward long elements.
The experiment evaluates the effects of smoothing, length priors, and index cut-offs on retrieval performance. The results indicate that length priors improve retrieval performance significantly. While removing shorter elements from the index does improve performance, this improvement is far less than that obtained by the use of length priors. The results also indicate that the smoothing parameter is dependent on the length prior.
The primary contribution of this paper is the reconsideration of the concept of document length normalization in a new context, that of XML retrieval. This paper also provides possible techniques that could be used for XML element length normalization.