This section gives a list of SSML tags and examples of how those
tags are used.
- <speak>
- This is the root element for SSML documents. Valid attributes are:
- xml:lang
- This is a required attribute specifying the language. Accepted values
are at http://www.ietf.org/rfc/rfc3066.txt
- xml:base
- This is an optional attribute specifying the base URI to use for resolving
relative paths.
- version
- This is a required attribute specifying the SSML Specification. The accepted
value is "1.0".
Example:
<speak xml:lang="En-US" version="1.0"
xml:base="http://www.myfileserver.com/mydir">text to be spoken</speak>
- <paragraph> or <p> or <sentence> or <s>
- These are optional tags that can be used to give text structure hints
to the TTS system. The only valid attribute is xml:lang, which does
allow values to be placed even though language switching is not supported.
Example:
<speak xml:lang="En-US" version="1.0"
<paragraph>
<sentence>Text within a sentence tag.</sentence>
<s>More text.</s>
</paragraph>
</speak>
Note: If the enclosed text in an SSML <sentence> or <paragraph> tag
does not end with an end-of-sentence punctuation character (like a period),
a longer than normal pause is added to the synthesized audio for this text.
- <say-as>
- The say-as tag allows the author to indicate information on the type of
text contained within the tag and to help specify the level of detail for
rendering the text. The required attribute for this tag is interpret-as .
There are two optional attributes, format and detail, which
are only used with particular values within the interpret-as attribute.
These optional attributes are illustrated within the entries for their associated
values.
- letters
- This value spells out the characters in a given word within the enclosed
tag.
Example (This will spell out "HELLO"):
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="letters">Hello</say-as>
</speak>
- digits
- This value spells out the digits in a given number within the enclosed
tag.
Example (This will spell out "123456"):
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="digits">123456</say-as>
</speak>
- vxml:digits
- This value performs the same function as the digits value.
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:digits">123456</say-as>
</speak>
- date
- This value will speak the date within the enclosed tag, using the format
given in the associated format attribute. The format attribute
is required for use with the date value of interpret-as, but if format is
not present, the engine will still attempt to pronounce the date.
Example
(This gives a list of dates in all the various formats: )
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="date" format="mdy">12/17/2005</say-as>
<say-as interpret-as="date" format="ymd">2005/12/17</say-as>
<say-as interpret-as="date" format="dmy">17/12/2005</say-as>
<say-as interpret-as="date" format="ydm">2005/17/12</say-as>
<say-as interpret-as="date" format="my">12/2005</say-as>
<say-as interpret-as="date" format="md">12/17</say-as>
<say-as interpret-as="date" format="ym">2005/12</say-as>
</speak>
- ordinal
- This value will speak the ordinal value for the given digit within the
enclosed tag.
Example (This will say "second first"):
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="ordinal">2</say-as>
<say-as interpret-as="ordinal">1</say-as>
</speak>
- cardinal
- This value will speak the cardinal number corresponding to the Roman numeral
within the enclosed tag.
Example (This will say "Super Bowl thirty-nine"):
<speak xml:lang="En-US" version="1.0">
Super Bowl <say-as interpret-as="cardinal">XXXIX</say-as>
</speak>
- number
- This value is an alternative to using the values given above. Using the format attribute
to determine how the number is to be interpreted, you can enter one series
of number and have it pronounced several different ways, as in the example.
The example also includes two different ways of pronouncing a series of numbers
as a telephone number. To have the series pronounced with the punctuation
included, you must add the detail attribute.
Example:
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="number">123456</say-as>
<say-as interpret-as="number" format="ordinal">123456</say-as>
<say-as interpret-as="number" format="cardinal">123456</say-as>
<say-as interpret-as="number" format="telephone">555-555-5555</say-as>
<say-as interpret-as="number" format="telephone" detail="punctuation">555-555-5555</say-as>
</speak>
- vxml:boolean
- This value will speak "yes" or "no" depending on the value given within
the enclosed tag.
Example:
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:boolean">true</say-as>
<say-as interpret-as="vxml:boolean">false</say-as>
</speak>
- vxml:date
- This value works like the date value, except that the format is
predefined as YYYYMMDD. When a value is not known, or you do not wish it to
be displayed, a question mark is used to replace that value, as shown in the
example.
Example:
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:date">20050720</say-as>
<say-as interpret-as="vxml:date">????0720</say-as>
<say-as interpret-as="vxml:date">200507??</say-as>
</speak>
- vxml:currency
- This value is used to control the synthesis of monetary quantities. The
string must be written in the "UUUmm.nn" format, where "UUU" is the three
character currency indicator specified by ISO standard 4217, and "mm.nn" is
the amount.
Example (This will say "forty-five dollars and thirty cents"):
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:currency">USD45.30</say-as>
</speak>
If there are more than two decimal places in
the number within the enclosed tag, the amount will be synthesized as a decimal
number followed by the currency indicator. If the three character currency
indicator is not present, the number will be synthesized as a decimal only,
with no pronunciation of currency type.
Example 2 (This will say "forty-five
point three two nine US dollars"):
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:currency">USD45.329</say-as>
</speak>
- vxml:phone
- This value will speak a phone number with both digits and punctuation,
similar to the number value used with format="telephone".
Example:
<speak xml:lang="En-US" version="1.0">
<say-as interpret-as="vxml:phone">555-555-5555</say-as>
</speak>
- <phoneme>
- The SSML phoneme tag enables users to provide a phonetic pronunciation
for the enclosed text. This tag has two attributes:
- alphabet
- This attribute specifies the phonology used. The supported alphabets to
designate are "ipa," for the International Phonetic Alphabet, and "ibm," for
the SPR representation discussed in Introduction to symbolic phonetic representations.
The alphabet attribute is optional. If no alphabet is designated, the default
value used is "ibm."
- ph
- This attribute specifies the pronunciation. It is a required attribute.
This example shows how a pronunciation for "tomato" is specified
using the IPA phonology, where the symbols are given using Unicode:
<speak xml:lang="En-US" version="1.0">
<phoneme alphabet="ipa" ph="təmeiɾou̥">tomato</phoneme>
</speak>
This example shows how a pronunciation for "tomato"
is specified using the SPR phonology:
<speak xml:lang="En-US" version="1.0">
<phoneme alphabet="ibm" ph=".0tx.1me.0fo">tomato</phoneme>
</speak>
- <sub>
- This tag is used to indicate that the text included in the alias attribute
is to replace the text enclosed within the tag when speech is synthesized.
The only attribute for this tag is the alias attribute, and it is required.
Without the alias attribute defined an error will result.
Example:
<speak xml:lang="En-US" version="1.0">
<sub alias="International Business Machines">IBM</sub>
</speak>
- <voice>
- This tag is used when a change in voice is required. Although all attributes
listed are optional, without any attributes defined an error will result.
The optional attributes are:
- age
- Accepted values are positive integers between the ages of 14 and 60 for
both male and female.
- gender
- Accepted values are "male" and "female".
- name
- Accepted values are the installed voices' names.
- variant
- Accepted values are positive integers.
Examples:
<speak xml:lang="En-US" version="1.0">
<voice age="any positive integer between 14 and 60">Female voice .</voice>
<voice gender="female">This is a female voice.</voice>
<voice name="Allison">Use the IBM TTS voice named Allison.</voice>
<voice name="Allison, Andrew, Tyler">Use the first available IBM TTS voice named in the given list.</voice>
</speak>
When using voice variant, you must have two
female voices of the same language installed.
- <voice variant="1">Hello, my name is Tyler, I am the second
US English female voice for TTS.</voice>
You do not need to specify a voice variant to use the default voice,
but to change to a different voice, you must specify the voice variant as
"1".
- <emphasis>
- The <emphasis> element is currently not supported.
- <break>
- This tag inserts pauses into the spoken text. It has the following optional
attributes:
- strength
- This attribute specifies the length of a pause in terms of varying strength
values: "none," "x-weak," "weak," "medium," "strong," or "x-strong."
- time
- This attribute specifies the length of the pause in terms of seconds or
milliseconds. The values formats are "NNNs" for seconds or "NNNms" for milliseconds.
Example:
<speak xml:lang="En-US" version="1.0">
Different sized <break strength="none">pauses.</break>
Different sized <break strength="x-weak">pauses.</break>
Different sized <break strength="weak">pauses.</break>
Different sized <break strength="medium">pauses.</break>
Different sized <break strength="strong">pauses.</break>
Different sized <break strength="x-strong">pauses.</break>
Different sized <break time="1s">pauses.</break>
Different sized <break time="1000ms">pauses.</break>
</speak>
- <prosody>
- This tag controls the pitch, range, speaking rate, and volume of the text.
All attributes are optional, but if no attribute is given an error results.
Here is a description of the optional attributes:
- pitch
- This attribute modifies the baseline pitch for the text enclosed within
the tag. Accepted values are either:
- a number followed by the Hz designation
- a relative change
- "x-low"
- "low"
- "medium"
- "high"
- "x-high"
- "default"
- range
- This attribute modifies the pitch range for the text enclosed within the
tag. Accepted values for this attribute are the same as the accepted values
for pitch.
- rate
- This attribute indicates a change in the speaking rate for contained
text. Accepted values are:
- a relative change
- a positive number
- "x-slow"
- "slow"
- "medium"
- "fast"
- "x-fast"
- "default"
The rate is specified in terms of words-per-minute. If the speaking
rate is 50 words per minute, then rate=50. If the setting is rate=+10, the
speaking rate will be 10 words per minute faster than your current rate setting.
Note: When
rate is set to a positive number, the implementation is not compliant with
the current W3C prosody rate attribute specification.
- volume
- This attribute modifies the volume for the contained text. The range for
values is "0.0" to "100.0" or the relative values of :
- "silent"
- "x-soft"
- "soft"
- "medium"
- "loud"
- "x-loud"
- "default"
Examples:
<speak xml:lang="En-US" version="1.0">
<prosody pitch="150Hz"> Modified pitch </prosody>
<prosody pitch="-20Hz"> Modified pitch </prosody>
<prosody pitch="+20Hz"> Modified pitch </prosody>
<prosody pitch="-12st"> Modified pitch </prosody>
<prosody pitch="+12st"> Modified pitch </prosody>
<prosody pitch="x-low"> Modified pitch </prosody>
<prosody range="150Hz"> Modified pitch range</prosody>
<prosody range="-20Hz"> Modified pitch range</prosody>
<prosody range="+20Hz">Modified pitch range</prosody>
<prosody range="-12st">Modified pitch range</prosody>
<prosody range="+12st">Modified pitch range</prosody>
<prosody range="x-high">Modified pitch range</prosody>
<prosody rate="slow">Modified speaking rate</prosody>
<prosody rate="+25">Modified speaking rate</prosody>
<prosody rate="-25">Modified speaking rate</prosody>
<prosody volume="88.9">Modified volume</prosody>
<prosody volume="loud">Modified volume</prosody>
</speak>
- <audio>
- This tag inserts recorded elements into the TTS generated audio. The only
attribute is src and is required. This attribute specifies the location
of the file to be inserted.
Example:
This is an example of the <audio src="http://www.myfiles.com/files/beep.wav"/> audio being inserted from somewhere else. </speak>
- <mark>
- This empty element tag allows the user to place a marker into the text
to be synthesized. The synthesis engine notifies the calling program when
the engine reaches the marker during synthesis. The mark tag does not affect
speech output. It has one required attribute: name. The name attribute
is of the type xsd:token.
Example:
<speak xml:lang="En-US" version="1.0">
Example using <mark name="here"/> mark tags.</speak>
- <lexicon>
- This tag introduces pronunciation dictionaries for the given SSML document.
The lexicon tag is an immediate child of the speak tag. Its required attribute
is uri, which specifies the location of the lexicon file.
Example:
<speak xml:lang="En-US" version="1.0">
<lexicon uri="http://www.myfiles.com/lexicons.lex"/>
</speak>
SSML tips
- Spacing and the <sub> element
- The following syntax will manifest a problem when spoken:
- Example: <s> The distance is 17<sub alias = "feet">
ft. </sub></s>
-
In this example the TTS engine is supposed to read the word feet normally.
Instead, because the <sub> element is adjacent to the
numeral 17, the word feet is erroneously spelled character-by-character.
To resolve, insert a space on either side of the annotation.
Additional information regarding the <sub> element and spaces:
When using the <sub> element, ensure any spaces you
need are on the outside of the tag as opposed to inside the tag. Any spaces
inside the tag will be replaced by whatever values are in the alias attribute.
For example:
- This is 3<sub alias="feet"> ft</sub>
-----> will become: This is 3feet
- This is 3 <sub alias="feet">ft</sub> ----->
will become: This is 3 feet
Similarly with spaces after the </sub> :
- This is 3<sub alias="feet"> ft </sub>2 inches ----->
will become: This is 3feet2inches
- This is 3 <sub alias="feet">ft</sub> 2 inches ----->
will become: This is 3 feet 2 in