Configuring voice settings for agents

Edit online

Create and manage voice configurations that connect your agent to speech services, enabling spoken conversations through audio based channels like Phone with Genesys Audio Connector and Phone with SIP.

Preview: This feature is currently available only in preview.

Configure your agent to communicate with users by using voice. Voice interactions can improve accessibility and enhance user engagement. When an agent is configured with a voice, that voice can be used in audio based channels, including Phone with Genesys Audio Connector and Phone with SIP. This integration ensures a consistent, natural conversational experience across all supported voice interactions.

On IBM watsonx Orchestrate, you can assign a voice configuration to multiple agents. However, each agent in any environment, whether Draft or Live, can have only one voice configuration.

Before you begin

Before you configure a voice for your agent, make sure that you have the required resources for the Speech to Text and Text to Speech services that you plan to use.

If you select IBM Watson Speech to Text and IBM Watson Text to Speech as your service providers:
- Access to IBM Watson Speech to Text and Text to Speech service instances. These services are required to convert voice input into text and to generate spoken responses from your text input.
- API details, including the API key, for both instances. You need the API key and endpoint URL for each service to connect them to your agent. To get the API details or create new instances of these services, access the IBM Cloud page.
If you select ElevenLabs:
- You can either bring in your own API key from ElevenLabs or sign up on ElevenLabs to create a new API key.
- Include Text to Speech, Voices (read), Models (read) scopes in ElevenLabs.
- Be generated from a paid or enterprise trial account. Personal free accounts will not work.

Creating a voice configuration

To enable voice interactions in your agent, you must first create a voice configuration. This configuration connects your agent to Speech to Text and Text to Speech services to understand spoken input and respond with synthesized speech. After you create the configuration, assign it to your agent to support voice-based conversations.

To create a voice configuration and enable it in an agent:

From the main menu, select Manage > Voice.
Click Create voice configuration.
In the Details tab, enter a name for the voice configuration and click Next.
In the Speech to Text tab, you must configure a Speech to Text service.

Note:
watsonx Orchestrate now supports Deepgram as a Speech to Text provider.
1. Select the Speech to text provider.
2. If you select Watson Speech to Text:
  1. Enter the API URL of the Watson Speech to Text instance.
  2. Enter the API key of this instance.
  3. Select the Speech to Text language model.
  4. Enter the unique identifier of the custom language model that you want to use. Keep the field blank if you do not want any customization.
    
    Note:
    A custom model can be used only with the base model for which it is created. The base model must match the language model that you select. By default, no custom language model is used with a request.
  5. Select a value to set the level to which background audio and side conversations are to be suppressed in the input audio. The default value is 0.0, which provides no suppression of background audio.
  6. Select a value to set the silence duration. This value indicates the pause interval at which the service splits a transcript into multiple final results if it encounters silence. By default, the service uses a pause interval of 0.8 seconds for all languages. For Chinese, it uses an interval of 0.6 seconds.
  7. Set Profanity filter to on, if you want the service to censor profanity from its results. By default, the service obscures all profanity by replacing it with a series of asterisks in the transcript.
    
    Note:
    The profanity filter feature is generally available for US English and Japanese only.
  8. Select Low latency if you want to receive results faster by optimizing speed over accuracy.
    
    Note:
    Low latency is not available with large speech models and previous-generation models.
  9. Set Smart formatting to on, if you want to convert dates, times, numbers, phone numbers, currency, email, and web addresses into readable formats that promote better post-processing of the transcription.
  10. Set Redaction to on, if you want to mask numeric data from final transcripts. It redacts sensitive numeric data, such as credit card numbers. In any number that has three or more consecutive digits, each digit is replaced with an X character.
  For more information, see the Speech to Text API documentation and Parameter Summary.
3. If you select Deepgram:
  1. Select the model of the voice.
  2. Select the language of the voice.
    
    Note:
    If you select Multilingual, you can transcribe conversations where speakers switch between multiple languages.
  3. Set Use numerals to On, if you want to convert numbers from textual format to numerical format.
  4. Enter key terms that help the model recognize important words, such as names, uncommon phrases, or jargon.
Click Next.
In the Text to Speech tab, you must configure a Text to Speech service.

Note:
watsonx Orchestrate now supports ElevenLabs and Deepgram as Text to Speech providers.
1. Select the Text to speech provider.
2. If you select Watson Text to Speech:
  1. Enter the API URL of the Watson Text to Speech instance.
  2. Enter the API key of this instance.
  3. Select the model language.
  4. Select the model voice.
  5. Set the speed and pitch of the voice.
  6. Enter the unique identifier for the custom model that you want to use. Keep the field blank if you do not want any customization.
    
    For more information, see the Text to Speech documentation.
3. If you select ElevenLabs:
  1. Select the appropriate data center region.
    
    Note:
    ElevenLabs operates data centers in both the United States and the European Union. Access to the EU region is available only to users who have an EU‑specific ElevenLabs account that is specially provisioned.
  2. Enter the API key for the data center region you selected.
  3. Select the model.
  4. Select the model voice.
  5. Select the model language.
  6. Set Speaker Boost to On, if you want to improve the voice quality.
  7. Set the speed and stability of the voice.
  8. Set the style and similarity of the voice.
  9. Select either Auto, or On, or Off based on whether you want to have text normalization.
4. If you select Deepgram:
  1. Select the model language.
  2. Select the model voice.
Use the Preview on the right side of the page to test the voice that you configured.
Click Next.
Configure call holds during voice interactions in Audio Cues.
1. Clear Play a typing sound while the AI agent generates a response if you do not want to hear a typing sound during AI responses. By default, this option is selected. The typing sound or on-hold flow plays only when the system detects that the agent is taking longer than expected to respond.
  1. If you select this option, select a value to set the time duration (in seconds) for which the typing sound plays.
2. Set the Pre-hold message that plays before the hold music begins.
3. Select the music that plays while the call is on hold.
4. Select a value to set the time duration (in seconds) for which the music plays before the hold message begins.
5. Set the message that plays while the callers are on hold.
Set Enable Voice Activity Detection (VAD) to on, if you want to configure how interruptions are handled during a voice conversation.
1. Select a value to set the confidence threshold for detecting speech.
2. Enter the time value (in seconds) which is the minimum duration of detected speech before it is considered valid and trigger an interruption.
3. Enter the time value (in seconds) which is the duration of silence that is required to mark the end of speech.
4. Select a value to set the minimum volume level to qualify as speech.
Configure Dual-Tone Multi-Frequency (DTMF) settings to customize how your agent handles keypad input.
1. Select a value to set the maximum wait time for additional keypad input after you press a digit.
2. Select the DTMF termination character that ends input collection, such as #.
3. Enter a value to set the maximum number of digits to collect before the input is processed.
Set Manage user silence to on, if you want to set up silence detection and recovery prompts for your agent.
1. Select a value to set the silence duration threshold. The agent waits for this duration (in seconds), and then identifies the user as silent.
2. Enter the number of check-in repetitions. This value determines how many times the agent attempts to reengage a silent user. After these attempts, the agent ends the conversation.
3. Enter the prompt message that is used for both the initial check-in and repeated check-in.
4. Enter the message to play before the call ends when the maximum number of attempts is reached. Leave this field empty to end the call without playing a message.
Click Finish.

You have a voice configuration available to define voice interactions for your agent.

Editing voice configuration

You can update an existing voice configuration to change the voice settings used by your agent. By editing a configuration, you can switch to a different voice, or adjust language support without creating a new configuration.

To edit the voice configuration:

On the Voice page, choose the voice configuration and click the three vertical ellipses.
Select the Edit option.
Apply the changes through the three tabs.
Click Save.

After you save your changes, the updated voice configuration is applied to your agent.

Deleting voice configuration

If a voice configuration is no longer needed, you can delete it to keep your agent settings organized and up to date.

To delete the voice configuration:

On the Voice page, choose the voice configuration and click the three vertical ellipses.
Select the Delete option.
In the pop-up window that appears, click Delete.

Note:

If the voice configuration is linked to one or more agents, you must remove all links before the deletion.

After you delete the configuration, it is removed from the list and can no longer be used by any agents.

Selecting the voice in the agent

After you create a voice configuration, assign it to your agent to enable voice communication. This step connects your agent to the configured speech services, allowing it to process spoken input and respond with synthesized speech. Assigning the voice configuration can ensure that your agent supports voice-based interactions during conversations.

To select the voice configuration in your agent:

Open the agent in the agent builder.
In the Profile tab, go to the Voice modality section.
In the Voice configuration field, start to type the name of the voice configuration. From the list that is displayed, select the voice configuration.

Note:
To enable the voice modality, you must create at least one voice configuration.
Configure the welcome message type that the agent delivers at the start of a voice interaction.
- Select AI-generated welcome message if you want to hear a welcome message that is AI-generated.
- Select Static welcome message if you want to hear the custom welcome message that you set for the agent in Welcome message.

After you select the voice configuration, your agent is ready to handle voice interactions.

Testing the voice

After you assign a voice configuration to your agent, you can test it in the chat preview to make sure that voice interactions work as expected. Testing helps you verify that the agent can recognize spoken input and respond with the correct voice output before you deploy the agent.

To test the voice configuration:

Open the agent in the agent builder.
In the Preview, click to begin the voice chat.
Allow microphone access in your browser when prompted.
Click to mute or unmute the conversation.
Click Show keypad to open the keypad.

You can use the keypad to test DTMF settings without ending the current voice chat or making a phone call. When you use the keypad, it sends DTMF events to the agent. Opening the keypad does not end the voice chat. You can continue speaking while the keypad is displayed.
Click Hide keypad to close the keypad and continue the voice chat session.
After you finish speaking with the agent, click to end the voice chat.

You can continue the chat mode in the same session when the voice chat ends. When you begin typing, the voice mode automatically switches to the chat mode, and changes to . The voice mode and send message controls are now consolidated into a single control for a cleaner chat interface.

After testing, review the agent’s responses to confirm that the voice behaves as intended. If needed, edit the voice configuration to make adjustments before you deploy the agent.

Enabling Voice Mode

To speak with the agent on the Orchestrate Chat page, you must enable the Voice Mode toggle. Enabling Voice Mode can ensure that your agent is ready for voice-based interactions in the live chat environment.

To enable Voice Mode:

Open the agent in the agent builder.
In the Channels tab, select the Home page section, and enable the toggle.

After you enable Voice Mode, click Voice mode icon to start speaking with the agent on the Orchestrate Chat page. Allow microphone access in your browser when prompted.

During your conversation, you can click Mute icon to mute or unmute the conversation. By clicking Show keypad, you can open the keypad and use it to test DTMF settings without ending the current voice chat or making a phone call. When you use the keypad, it sends DTMF events to the agent. Opening the keypad does not end the voice interaction. You can continue speaking while the keypad is displayed. By clicking Hide keypad, you can close the keypad and continue the voice chat session.

After you finish speaking with the agent, click Exit voice mode icon to end the voice chat.

You can continue the chat mode in the same session when the voice chat ends. When you begin typing, the voice mode automatically switches to the chat mode, and Voice mode icon changes to . The voice mode and send message controls are now consolidated into a single control for a cleaner chat interface.

Enabling Voice in Embedded agent

You can enhance your embedded agent with voice input and output to support natural, spoken interactions. After you create a voice configuration and assign it to your agent, you can enable it in the embedded agent to improve your customized chat experience. This setup allows the embedded agent to interpret user speech, generate audio responses, and engage users through seamless, conversational voice interactions.

For more information, see Enabling voice capabilities in the embedded agent.

What to do next

After you create a voice configuration and assign it to an agent, the agent can be connected to audio based channels, including Phone with Genesys Audio Connector and Phone with SIP.

For more information, see: