Introduction to pitch and its use with SSML

This section is about pitch fluctuations, and how pitch is used within the context of IBM TTS and SSML.

What is pitch?

The terms pitch and pitch range are familiar to musicians. Pitch is usually specified as the name of a note and an octave number. For example, A4 is the note "A" in octave 4. Pitch range is specified as the number of octaves that an instrument or a singer can cover, from the lowest note to the highest. For example, if the lowest note is A2 and the highest is A4, then the range is two octaves.

In the so-called tempered scale, each octave is divided into 12 semitones.

Each note corresponds to sound vibrations at a particular frequency, measured in Hertz (Hz), or cycles per second. For example the frequency of the note A4 is 440 Hz. An interval of one octave corresponds to a doubling or halving of the frequency. Thus A3 is at 220 Hz, and A5 is at 880 Hz.

The frequency of any note can be calculated from the formula f =440•2n /12 where f is the frequency in Hz and n is the number of semitones between A4 and the note in question. Thus, A4# is one semitone higher, so its frequency is 440•21/12=466.16376 Hz. A5 is an octave higher, that is 12 semitones, so its frequency is 440•212/12=440•2=880 Hz.

What is meant by the pitch range of a speaking voice?

Although the pitch range of a keyboard instrument is well defined, the pitch range of a speaking voice is not. Because of phenomena such as glottalization, it is possible for the time interval between pitch periods to occasionally be very large, which, if taken literally, would imply an extremely low pitch frequency. Similarly, there may be occasional excursions to high pitches not really characteristic of the speaker's general style. To achieve a more stable and meaningful measure of the speaker's pitch range, we define the 5-th percentile as the bottom of the range and the 95-th percentile as the top. In other words, the pitch range is defined so that the speaker's pitch stays in that range 90% of time. During 5% of the time the pitch is actually below the "bottom" and during another 5% it is above the "top." These extreme excursions are considered outliers, and not part of the normal pitch range.

The outliers, outside the normal pitch range, occur infrequently enough that in many single-sentence utterances, they may not occur at all. In fact, most short utterances will have a much narrower pitch range than the specified nominal range. Thus, if you request a pitch range of 200 Hz, and ask the synthesizer to say "Hello," you would not expect this short utterance to cover the full 200-Hz range.

What is base pitch?

The termbase pitch can be defined as the average frequency of the speaker's voice, measured in Hz. This means that the bottom of the pitch range will be below the base pitch, and the top of the range will be above it.

What is the relationship between pitch range and base pitch?

Normally, a higher base pitch goes along with a higher pitch range, when measured in Hz. If you raise the base pitch, but the pitch range in Hz is not changed, the voice begins to sound monotone. If the range is measured in semitones, however, it need not be changed when the base pitch is changed.

The larger the pitch range, the greater the difference between the lowest pitch and the base pitch, and the same goes for the difference between the base and the highest pitch. In other words, as the pitch range increases, the bottom pitch drops and the top pitch increases.

If I ask for a base pitch of 100 Hz and a range of 200 Hz, will the bottom of the range be 0 Hz and the top 200 Hz?

No. The top of the pitch range will be about 230 Hz, and the bottom will be about 30 Hz.

The pitch range is not centered at base pitch, when measured in Hz. The difference between the base pitch and the lowest pitch will be smaller than the difference between the base pitch and the highest pitch. In this example, the bottom is 70 Hz below the base, but the top is 130 Hz above the base pitch. The frequency of the bottom pitch is constrained to be non-negative, but there is no mathematical constraint on the top of the pitch range.

In technical terms—it is assumed the pitch distribution is log-normal in the frequency domain, which leads to a normal Gaussian bell curve in the semitone domain. The mean and the median of the log-normal density function are not equal. If you define the base pitch to be the mean, then median will therefore decrease if the pitch range is increased, while keeping the base pitch constant.

To explain this in regular terms—if you ask for a large pitch range, but specify a low base pitch, then overall pitch will spend most of the time well below the base, or average value, but will have occasional peaks at values much higher than the average so that the average still comes out correctly. For example, if the base pitch is 100 Hz and the range is 200 Hz, then the top of the pitch range will be about 230 Hz, and the bottom will be 30 Hz. Even though the average pitch is 100 Hz, the pitch contour will actually spend more than half of the time below 100 Hz. In fact, half the time it will be below 83 Hz, but during the other half it will go high enough to bring the overall average up to 100 Hz.

Are all combinations of base pitch and range possible?

Although MRCP and SSML allow base pitch and pitch range to be specified independently, not all combinations are usable.

The synthesizer will not produce a pitch lower than 50 Hz. This limit may be encountered if the base pitch is fairly low, for example 100 (typical for male speakers) and the requested pitch range is large, for example, 200 Hz. In that case, the bottom of the pitch range would be about 30 Hz, which is below the capability of the synthesizer. If you want such a large pitch range, you may want to experiment with slightly higher base pitches.

In general, pitch ranges that are equal to or greater than double the base pitch, in Hz, may cause difficulty. In semitones, this means a range of greater than 36 semitones, or 3 octaves.

As a rule, the default values for base pitch and range will produce the best sound quality, but you may want to change them for specific effects.