SSML tags

This section gives a list of SSML tags and examples of how those tags are used.

<speak>

This is the root element for SSML documents. Valid attributes are:

xml:lang: This is a required attribute specifying the language. Accepted values are at http://www.ietf.org/rfc/rfc3066.txt
xml:base: This is an optional attribute specifying the base URI to use for resolving relative paths.
version: This is a required attribute specifying the SSML Specification. The accepted value is "1.0".

Example:

<speak xml:lang="En-US" version="1.0" xml:base="http://www.myfileserver.com/mydir">text to be spoken</speak>

<paragraph> or or <sentence> or <s>

These are optional tags that can be used to give text structure hints to the TTS system. The only valid attribute is xml:lang, which does allow values to be placed even though language switching is not supported.

Example:

<speak xml:lang="En-US" version="1.0"
<paragraph>
<sentence>Text within a sentence tag.</sentence>
<s>More text.</s>
</paragraph>
</speak>

Note: If the enclosed text in an SSML <sentence> or <paragraph> tag does not end with an end-of-sentence punctuation character (like a period), a longer than normal pause is added to the synthesized audio for this text.

<say-as>

The say-as tag allows the author to indicate information on the type of text contained within the tag and to help specify the level of detail for rendering the text. The required attribute for this tag is interpret-as . There are two optional attributes, format and detail, which are only used with particular values within the interpret-as attribute. These optional attributes are illustrated within the entries for their associated values.

letters

This value spells out the characters in a given word within the enclosed tag.

Example (This will spell out "HELLO"):

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="letters">Hello</say-as>
</speak>

digits

This value spells out the digits in a given number within the enclosed tag.

Example (This will spell out "123456"):

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="digits">123456</say-as>
</speak>

vxml:digits

This value performs the same function as the digits value.

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:digits">123456</say-as>
</speak>

date

This value will speak the date within the enclosed tag, using the format given in the associated format attribute. The format attribute is required for use with the date value of interpret-as, but if format is not present, the engine will still attempt to pronounce the date.

Example (This gives a list of dates in all the various formats: )

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="date" format="mdy">12/17/2005</say-as>
<say-as interpret-as="date" format="ymd">2005/12/17</say-as>
<say-as interpret-as="date" format="dmy">17/12/2005</say-as>
<say-as interpret-as="date" format="ydm">2005/17/12</say-as>
<say-as interpret-as="date" format="my">12/2005</say-as>
<say-as interpret-as="date" format="md">12/17</say-as>
<say-as interpret-as="date" format="ym">2005/12</say-as>
</speak>

ordinal

This value will speak the ordinal value for the given digit within the enclosed tag.

Example (This will say "second first"):

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="ordinal">2</say-as>
<say-as interpret-as="ordinal">1</say-as>
</speak>

cardinal

This value will speak the cardinal number corresponding to the Roman numeral within the enclosed tag.

Example (This will say "Super Bowl thirty-nine"):

<speak xml:lang="En-US" version="1.0">
Super Bowl <say-as interpret-as="cardinal">XXXIX</say-as>
</speak>

number

This value is an alternative to using the values given above. Using the format attribute to determine how the number is to be interpreted, you can enter one series of number and have it pronounced several different ways, as in the example. The example also includes two different ways of pronouncing a series of numbers as a telephone number. To have the series pronounced with the punctuation included, you must add the detail attribute.

Example:

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="number">123456</say-as>
<say-as interpret-as="number" format="ordinal">123456</say-as>
<say-as interpret-as="number" format="cardinal">123456</say-as>
<say-as interpret-as="number" format="telephone">555-555-5555</say-as>
<say-as interpret-as="number" format="telephone" detail="punctuation">555-555-5555</say-as>
</speak>

vxml:boolean

This value will speak "yes" or "no" depending on the value given within the enclosed tag.

Example:

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:boolean">true</say-as>
<say-as interpret-as="vxml:boolean">false</say-as>
</speak>

vxml:date

This value works like the date value, except that the format is predefined as YYYYMMDD. When a value is not known, or you do not wish it to be displayed, a question mark is used to replace that value, as shown in the example.

Example:

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:date">20050720</say-as>
<say-as interpret-as="vxml:date">????0720</say-as>
<say-as interpret-as="vxml:date">200507??</say-as>
</speak>

vxml:currency

This value is used to control the synthesis of monetary quantities. The string must be written in the "UUUmm.nn" format, where "UUU" is the three character currency indicator specified by ISO standard 4217, and "mm.nn" is the amount.

Example (This will say "forty-five dollars and thirty cents"):

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:currency">USD45.30</say-as>
</speak>

If there are more than two decimal places in the number within the enclosed tag, the amount will be synthesized as a decimal number followed by the currency indicator. If the three character currency indicator is not present, the number will be synthesized as a decimal only, with no pronunciation of currency type.

Example 2 (This will say "forty-five point three two nine US dollars"):

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:currency">USD45.329</say-as>
</speak>

vxml:phone

This value will speak a phone number with both digits and punctuation, similar to the number value used with format="telephone".

Example:

<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:phone">555-555-5555</say-as>
</speak>

<phoneme>

The SSML phoneme tag enables users to provide a phonetic pronunciation for the enclosed text. This tag has two attributes:

alphabet: This attribute specifies the phonology used. The supported alphabets to designate are "ipa," for the International Phonetic Alphabet, and "ibm," for the SPR representation discussed in Introduction to symbolic phonetic representations. The alphabet attribute is optional. If no alphabet is designated, the default value used is "ibm."
ph: This attribute specifies the pronunciation. It is a required attribute.

This example shows how a pronunciation for "tomato" is specified using the IPA phonology, where the symbols are given using Unicode:

<speak xml:lang="En-US" version="1.0">
<phoneme alphabet="ipa" ph="t&#x259;mei&#x27E;ou&#x325;">tomato</phoneme>
</speak>

This example shows how a pronunciation for "tomato" is specified using the SPR phonology:

<speak xml:lang="En-US" version="1.0">
<phoneme alphabet="ibm" ph=".0tx.1me.0fo">tomato</phoneme>
</speak>

This tag is used to indicate that the text included in the alias attribute is to replace the text enclosed within the tag when speech is synthesized. The only attribute for this tag is the alias attribute, and it is required. Without the alias attribute defined an error will result.

Example:

<speak xml:lang="En-US" version="1.0">
<sub alias="International Business Machines">IBM</sub>
</speak>

<voice>

This tag is used when a change in voice is required. Although all attributes listed are optional, without any attributes defined an error will result. The optional attributes are:

age

Accepted values are positive integers between the ages of 14 and 60 for both male and female.

gender

Accepted values are "male" and "female".

name

Accepted values are the installed voices' names.

variant

Accepted values are positive integers.

Examples:

<speak xml:lang="En-US" version="1.0">
<voice age="any positive integer between 14 and 60">Female voice .</voice>
<voice gender="female">This is a female voice.</voice> 
<voice name="Allison">Use the IBM TTS voice named Allison.</voice> 
<voice name="Allison, Andrew, Tyler">Use the first available IBM TTS voice named in the given list.</voice> 
</speak>

When using voice variant, you must have two female voices of the same language installed.

<voice variant="1">Hello, my name is Tyler, I am the second US English female voice for TTS.</voice>

You do not need to specify a voice variant to use the default voice, but to change to a different voice, you must specify the voice variant as "1".

<emphasis>

The <emphasis> element is currently not supported.

<break>

This tag inserts pauses into the spoken text. It has the following optional attributes:

strength: This attribute specifies the length of a pause in terms of varying strength values: "none," "x-weak," "weak," "medium," "strong," or "x-strong."
time: This attribute specifies the length of the pause in terms of seconds or milliseconds. The values formats are "NNNs" for seconds or "NNNms" for milliseconds.

Example:

<speak xml:lang="En-US" version="1.0">
Different sized <break strength="none">pauses.</break>
Different sized <break strength="x-weak">pauses.</break>
Different sized <break strength="weak">pauses.</break>
Different sized <break strength="medium">pauses.</break>
Different sized <break strength="strong">pauses.</break>
Different sized <break strength="x-strong">pauses.</break>
Different sized <break time="1s">pauses.</break>
Different sized <break time="1000ms">pauses.</break>
</speak>

<prosody>

This tag controls the pitch, range, speaking rate, and volume of the text. All attributes are optional, but if no attribute is given an error results. Here is a description of the optional attributes:

pitch

This attribute modifies the baseline pitch for the text enclosed within the tag. Accepted values are either:

a number followed by the Hz designation
a relative change
"x-low"
"low"
"medium"
"high"
"x-high"
"default"

range

This attribute modifies the pitch range for the text enclosed within the tag. Accepted values for this attribute are the same as the accepted values for pitch.

rate

This attribute indicates a change in the speaking rate for contained text. Accepted values are:

a relative change
a positive number
"x-slow"
"slow"
"medium"
"fast"
"x-fast"
"default"

The rate is specified in terms of words-per-minute. If the speaking rate is 50 words per minute, then rate=50. If the setting is rate=+10, the speaking rate will be 10 words per minute faster than your current rate setting.

Note: When rate is set to a positive number, the implementation is not compliant with the current W3C prosody rate attribute specification.

volume

This attribute modifies the volume for the contained text. The range for values is "0.0" to "100.0" or the relative values of :

"silent"
"x-soft"
"soft"
"medium"
"loud"
"x-loud"
"default"

Examples:

<speak xml:lang="En-US" version="1.0">
<prosody pitch="150Hz"> Modified pitch </prosody>
<prosody pitch="-20Hz"> Modified pitch </prosody>
<prosody pitch="+20Hz"> Modified pitch </prosody>
<prosody pitch="-12st"> Modified pitch </prosody>
<prosody pitch="+12st"> Modified pitch </prosody>
<prosody pitch="x-low"> Modified pitch </prosody>
<prosody range="150Hz"> Modified pitch range</prosody>
<prosody range="-20Hz"> Modified pitch range</prosody>
<prosody range="+20Hz">Modified pitch range</prosody>
<prosody range="-12st">Modified pitch range</prosody>
<prosody range="+12st">Modified pitch range</prosody>
<prosody range="x-high">Modified pitch range</prosody>
<prosody rate="slow">Modified speaking rate</prosody>
<prosody rate="+25">Modified speaking rate</prosody>
<prosody rate="-25">Modified speaking rate</prosody>
<prosody volume="88.9">Modified volume</prosody>
<prosody volume="loud">Modified volume</prosody>
</speak>

<audio>

This tag inserts recorded elements into the TTS generated audio. The only attribute is src and is required. This attribute specifies the location of the file to be inserted.

Example:

This is an example of the <audio src="http://www.myfiles.com/files/beep.wav"/> audio being inserted from somewhere else. </speak>

This empty element tag allows the user to place a marker into the text to be synthesized. The synthesis engine notifies the calling program when the engine reaches the marker during synthesis. The mark tag does not affect speech output. It has one required attribute: name. The name attribute is of the type xsd:token.

Example:

<speak xml:lang="En-US" version="1.0">
Example using <mark name="here"/> mark tags.</speak>

<lexicon>

This tag introduces pronunciation dictionaries for the given SSML document. The lexicon tag is an immediate child of the speak tag. Its required attribute is uri, which specifies the location of the lexicon file.

Example:

<speak xml:lang="En-US" version="1.0">
<lexicon uri="http://www.myfiles.com/lexicons.lex"/>
</speak>

SSML tips

Spacing and the element

The following syntax will manifest a problem when spoken:

Example: <s> The distance is 17 ft. </s>

In this example the TTS engine is supposed to read the word feet normally. Instead, because the  element is adjacent to the numeral 17, the word feet is erroneously spelled character-by-character. To resolve, insert a space on either side of the annotation.

Additional information regarding the element and spaces:

When using the  element, ensure any spaces you need are on the outside of the tag as opposed to inside the tag. Any spaces inside the tag will be replaced by whatever values are in the alias attribute.

For example:

This is 3 ft -----> will become: This is 3feet
This is 3 ft -----> will become: This is 3 feet

Similarly with spaces after the :

This is 3 ft 2 inches -----> will become: This is 3feet2inches
This is 3 ft 2 inches -----> will become: This is 3 feet 2 in