This paper is a brief, readable, and informative summary of research over the past decade on simulating the rhythm of natural English for the purposes of speech synthesis. It describes the phonetic and rhythmic influences on speech timing and discusses evidence for and against the hypothesis that rhythmic stresses tend to be equally spaced in English (the theory of isochronous feet). It outlines two algorithms for assigning phoneme durations to synthetic speech. The first proceeds top down by allocating durations to interstress intervals based on the isochrony hypothesis and by subdividing these between various lower-level entities. The second is founded instead on an extensive statistical analysis of phoneme durations in human speech. The latter method, which is judged superior, is described in some detail; this is welcome since it was originally published in a specialist phonetics conference.
There is some discussion of the rhythmic effects produced by these methods. It is surprising to find no mention of what kind of English was used for synthesis. Linguists on either side of the Atlantic have, over the years, developed radically different systems for characterizing the prosody of speech (including rhythm); some discussion of British/American dialect differences would have been both appropriate and interesting.