Troubleshooting your Watson Speech services installation

You can use this troubleshooting information to diagnose and resolve problems with your Speech services installation. The information documents example scenarios of things that can go wrong and how to identify and debug the root-cause problems.

Permissions you need for these tasks:
You must be an administrator of the Red Hat® OpenShift® project to manage the cluster.

Troubleshooting topics

See the following scenarios for more information about troubleshooting the different problems:

Note: In the commands, ${PROJECT_CPD_INST_OPERATORS} is the name of the project (namespace) in which the Watson Speech operator is deployed, and ${PROJECT_CPD_INST_OPERANDS} is the name of the project (namespace) in which the Speech services are installed.

Watson Speech is missing RabbitMQ operand pods after upgrade to 4.8.5

Symptoms: The stt-async pod is in Init state waiting for RabbitMQ
speech-cr-stt-async-849f8d46b9-zdlkn                              0/1     Init:1/3    0             8h
The RabbitMQ operand pods doesn't get created and there are errors in the RabbitMQ sts
oc describe sts speech-cr-rabbitmq -n 
Error:
Warning  FailedCreate  8m38s (x27 over 69m)  statefulset-controller  create Pod speech-cr-rabbitmq-ibm-rabbitmq-0 in StatefulSet speech-cr-rabbitmq-ibm-rabbitmq failed error: pods "speech-cr-rabbitmq-ibm-rabbitmq-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider "wkc-iis-scc": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .initContainers[0].runAsUser: Invalid value: 999: must be in the ranges: [1000790000, 1000799999], provider restricted-v2: .containers[0].runAsUser: Invalid value: 999: must be in the ranges: [1000790000, 1000799999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "noobaa-db": Forbidden: not usable by user or serviceaccount, provider "noobaa-endpoint": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "rook-ceph": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "rook-ceph-csi": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
Workaround: Delete the RabbitMQ CR, the speech operator recreates the RabbitMQ CR
oc delete rabbitmqcluster speech-cr-rabbitmq -n <cpd-instance-namespace>

The Watson Speech runtime pods and model/voice upload pods are stuck in Init state

The Multicloud Object Gateway must be installed as a prerequisite of the Watson Speech service. If the Speech to Text runtime pod and model upload pod (or the Text to Speech runtime pod and voice upload pod) are stuck in the Init state, Multicloud Object Gateway may not be installed or configured correctly. To confirm, run the following command:


oc describe pod runtime-pod-name -n ${PROJECT_CPD_INST_OPERANDS}

If the output contains the following warning message, the Multicloud Object Gateway datastore was not installed or configured properly.


Warning  FailedMount  101s (x10 over 5m51s)  kubelet            MountVolume.SetUp failed for volume "minio-account" : secret "noobaa-account-watson-speech" not found

Once the Multicloud Object Gateway is installed and the Watson Speech service secret is configured, the service will be able to connect to the Multicloud Object Gateway and installation will proceed.

The Watson Speech operator pod fails to start

The Watson Speech operator pod fails to start.

  1. Learn the name of the pod for the operator:

    oc get pods -l app.kubernetes.io/name=watson-speech -n ${PROJECT_CPD_INST_OPERATORS}
  2. Use the following command to learn more about the nature of the problem. In the command, pod-name is the name of a pod whose status you want to learn.

    oc describe pod-name -n ${PROJECT_CPD_INST_OPERATORS}
  3. You can send the log files for the pod to IBM Support for further help.

Some pods are in the pending state

Some Speech services pods are stuck in the Pending status.

  1. Use the following command to learn more about the nature of the problem. In the command, pod-name is the name of a pod whose status is Pending.

    oc describe pod-name -n ${PROJECT_CPD_INST_OPERANDS}

Some possible causes of the problem follow:

  • Insufficient resources (memory and CPU) are available for the pod.

  • The pod is unable to pull the container image or images.

Installation of Watson Speech services fails

Installation of the Watson Speech services returns an error message of the following form:

TASK [utils : applying CR <speech-cr> for Watson Speech to Text] ********************************************
Tuesday 1 November 2022 17:44:48 +0000 (0:00:02.140) 0:01:08.881 ****** fatal: [localhost]: FAILED! =>
{"changed": false, "error": 422, "msg": "Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},
\"status\":\"Failure\",\"message\":\"WatsonSpeech.speech.watson.ibm.com \\\\\"<speech-cr>\\\\\" is invalid: spec.tags: 
Required value\",\"reason\":\"Invalid\",\"details\":{\"name\":\"<speech-cr>",\"group\":\"speech.watson.ibm.com\",\"kind
\":\"WatsonSpeech\",\"causes\":[{\"reason\":\"FieldValueRequired\",\"message\":\"Required value\",\"field\":\"spec.tags\"}]},
\"code\":422}\\n'", "reason": "Unprocessable Entity", "status": 422}

This message indicates that all of the Speech microservices were set to false during initial installation of the Speech services with the param-file option. You must set at least one of the microservices to true for the installation to succeed. For more information, see Specifying installation options.

The Watson Speech operator is running but no microservices are installed

The Watson Speech operator is running but none of the Speech microservices is being installed.

  1. Use the following command to verify that you created the Speech services custom resource in the desired namespace:

    oc get WatsonSpeech -n ${PROJECT_CPD_INST_OPERANDS}

    If the command does not return any information in its output, you might need to create the custom resource. For more information, see Installing Watson Speech services.

  2. If the custom resource exists, check the operator logs to determine why the microservices are not being installed.

Jobs to pull Speech to Text models are taking a long time to finish

Jobs with names like <custom-resource-name>-stt-models-string pull the images for Speech to Text models from Docker and upload them to the Multicloud Object Gateway datastore. The custom resource runs one job per enabled model.

The Speech to Text models are large. In some cases, these jobs can take from 20-25 minutes to complete. If you are concerned with the time it is taking to pull and upload the images.

Speech services initContainers are running for a long time

An initContainer is a special container that is run for a pod. All initContainers for a pod must be complete before the pod can start its regular container.

The Speech to Text and Text to Speech runtimes use an initContainer to wait for the Multicloud Object Gateway datastore to be running and for all installed models and voices to be uploaded. The initContainer for either of the runtime microservices might run for as long as 30 minutes, especially in the case of an online install, while it pulls the images from the IBM registry. If the initContainer for either runtime microservice continues to run for more than 30 minutes, use the appropriate command to check its log file:

  • For the Speech to Text service, run the following command to check the status of the service's models:

    oc logs -f runtime-pod-name -c wait4models -n ${PROJECT_CPD_INST_OPERANDS}
  • For the Text to Speech service, run the following command to check the status of the service's voices:

    oc logs -f runtime-pod-name -c wait4voices -n ${PROJECT_CPD_INST_OPERANDS}

In both commands, runtime-pod-name specifies the name of the pod for the Speech to Text or Text to Speech runtime. For example, the name is something like speech-cr-stt-runtime-85957944ff-wrzl4 for the Speech to Text runtime or speech-cr-tts-runtime-858bd6f96f-g7dcw for the Text to Speech runtime.

Possible reasons for the initContainer to run for a long time include the following:

  • The runtime pod is not able to connect to Multicloud Object Gateway, which might be in the process of starting up or might have experienced an error. Wait for Multicloud Object Gateway to start or check its log files.

  • Multicloud Object Gateway might be waiting for all of the required models and voices to be installed. Wait for all of the models and voices to be uploaded. You can use the following command to check the status of the jobs that are uploading the models and voices:

    oc get jobs -l 'app.kubernetes.io/component in (stt-models,tts-voices)' -n ${PROJECT_CPD_INST_OPERANDS}

    You can check the log files for the pods to determine whether a failure has occurred. Otherwise, wait for the upload jobs operation to complete.

The Watson Speech operator log indicates that the TLS secret was not created on time

The Watson Speech operator uses the certificate manager from the foundational services to create a secret name that can be used as a TLS certificate by the Speech services microservices. The following error in the Watson Speech operator log might indicate that microservices of the foundational services are not configured properly. In the message, <custom-resource-name> is the name of your Speech services custom resource.

Secret: <custom-resource-name>-instance-tls did not get created in time.

If this error occurs, contact IBM Support for assistance.

No status is reported for the Watson Speech service

The following command fails to report any status for the Watson Speech service (the response is empty):

oc get WatsonSpeech ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INST_OPERANDS}

To determine the cause of this problem, do the following:

  1. Use the following command to determine whether the Watson Speech operator pod is running:

    oc get pods -n ${PROJECT_CPD_INST_OPERANDS}

    The operator must be running for the oc get WatsonSpeech command to report its status. If the status of the operator pod indicates that it is still in the process of starting, wait for the operator to start running. The operator can take from 20-60 minutes to create or apply changes to your custom resource.

  2. Check the log file for the Watson Speech operator pod and check for any errors or problems.

Some Speech services are not running

The following command reports the status of some Speech services as NotRunning:

oc get WatsonSpeech ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INST_OPERANDS}

A status of NotRunning can indicate that the process is still starting up. It can take 20-60 minutes for the operator to complete and for the service to start running.

Training of custom acoustic models is failing

When you attempt to train custom acoustic models for Speech to Text, the service reports the following error messages:

Unresponsive backend detected. Please try later.

This message indicates that the Speech to Text AM Patcher does not have sufficient resources to handle its requests. To increase the number of CPUs that are available to the AM Patcher, use the custom resource property named sttAMPatcher.resources.requestsCPU to increase the value of the property from 1 to 5.

Allocating more resources prevents this error and enables custom acoustic models to be trained as expected. Increasing the value of the property increases the size of the deployment.

PostgreSQL pods stuck in Terminating state on upgrade

When you upgrade the Watson Speech services, you might encounter an issue where the PostgreSQL pods become stuck in the Terminating state. If this problem occurs during your upgrade, perform the following steps to resolve the problem.

  1. Use the following command to identify pods that remain in the Terminating state:

    oc get pods -n ${PROJECT_CPD_INST_OPERANDS} -o wide | awk {'print $1'}
  2. Use the following command to set the environment variable pods to include the list of pods that remain in the Terminating state:

    pods=$(oc get pods -n ${PROJECT_CPD_INST_OPERANDS} -o wide | grep Terminating | awk {'print $1'})
  3. Use the following command to delete the stuck pods so that the upgrade process can continue:

    oc delete pod $pods -n ${PROJECT_CPD_INST_OPERANDS} --force=true --grace-period=0

Upgrade to Watson Speech services version 4.8 and later fails to complete

When you upgrade to Watson Speech services version 4.8 and later, upgrade of the MinIO custom resource can fail because the MinIO backup job or the MinIO PVC creation job failed to be deleted in the previous upgrade procedure. The solution is to delete the backup and PVC creation jobs. The upgrade then proceeds normally. Perform the following steps to resolve the problem.

  1. To check the status of the MinIO custom resource, issue the following command:

    oc get MinioCluster ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INST_OPERANDS}

    The failed MinIO custom resource is identified by an entry of the following form:

    <custom-resource-name>   MinioCluster   8d    4          ReleaseFailed   True     UpgradeError

    You can run the following command to get more detailed information about the failure:

    oc describe MinioCluster ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INST_OPERANDS}

    The custom resource returns a status message similar to the following:

    [2:37 PM]   - lastTransitionTime: "2023-04-18T11:05:05Z"
        message: 'failed to upgrade release: pre-upgrade hooks failed: warning: Hook pre-upgrade
          ibm-minio/templates/minio-createpvc-job.yaml failed: jobs.batch "<custom-resource-name>-ibm-minio-create-pvc"
          already exists'
        reason: UpgradeError
        status: "True"
        type: ReleaseFailed
  2. To delete the failed MinIO PVC creation job, issue the following command:

    oc delete job ${CUSTOM_RESOURCE_SPEECH}-ibm-minio-create-pvc --namespace ${PROJECT_CPD_INST_OPERANDS}
  3. To determine whether the MinIO backup job remains undeleted, issue the following command:

    oc get job --namespace ${PROJECT_CPD_INST_OPERANDS} | grep ${CUSTOM_RESOURCE_SPEECH}-ibm-minio-backup

    The MinIO backup job that is not deleted is identified by an entry of the following form:

    <custom-resource-name>-ibm-minio-backup   1/1   3m25s   1d
  4. To delete the backup job, issue the following command:

    oc delete job ${CUSTOM_RESOURCE_SPEECH}-ibm-minio-backup --namespace ${PROJECT_CPD_INST_OPERANDS}

Once you delete these jobs, upgrade continues and completes.