For decades, sorting has been a popular topic. Sorting strings, however, has attracted less attention than sorting numbers. In particular, there have not been many metrics proposed to assist in the analysis of string sorting. Previous efforts mainly provided simplistic, average-case formulas. This paper reports the results of an empirical study to determine the average number of character comparisons during string sorting, as well as the required run times, assuming that strings (in English and Arabic) are sorted using the quicksort algorithm.
The paper uses the concept of critical factor (CF), which is defined as the position number where two strings disagree in a character-by-character comparison. The paper also introduces the notion of average CF, which is defined over a set of sorted strings, and the notion of run-time average CF.
Data strings were extracted from bibliographic databases: the data was not simple words, but rather sentences. Using the above metrics, the authors performed an extensive experiment to investigate differences between the English and Arabic languages. Interestingly, the average CF ranged from 1 to 7 for about 50 percent of English cases and about 60 percent of Arabic cases, whereas 24 percent of the Arabic strings and 16 percent of the English strings disagreed in the first, second, or third letter. Several similar remarks are extracted by interpreting the results. The paper is an interesting first step towards the empirical analysis of string sorting.