Troubleshooting your Watson Speech services installation

You can use this troubleshooting information to diagnose and resolve problems with your Speech services installation. The information documents example scenarios of things that can go wrong and how to identify and debug the root-cause problems.

Permissions you need for these tasks:: You must be an administrator of the Red Hat® OpenShift® project to manage the cluster.

Troubleshooting topics

See the following scenarios for more information about troubleshooting the different problems:

Installation of Watson Speech services fails
The Watson Speech operator pod fails to start
Some pods are in the pending state
The Watson Speech operator is running but no microservices are installed
Jobs to pull Speech to Text models are taking a long time to finish
Speech services initContainers are running for a long time
The Watson Speech operator log indicates that the TLS secret was not created on time
MinIO pods fail to start or have errors
No status is reported for the Watson Speech service
Some Speech services are not running
Training of custom acoustic models is failing
PostgreSQL pods stuck in Terminating state on upgrade
Upgrade to Watson Speech services version 4.6.3 and later fails to complete
Upgrade to Watson Speech services version 4.6.0 and later leaves unneeded PostgreSQL pods

Note: In the commands, ${PROJECT_CPD_OPS} is the name of the project (namespace) in which the Watson Speech operator is deployed, and ${PROJECT_CPD_INSTANCE} is the name of the project (namespace) in which the Speech services are installed.

The Watson Speech operator pod fails to start

The Watson Speech operator pod fails to start.

Learn the name of the pod for the operator:

oc get pods -l app.kubernetes.io/name=watson-speech -n ${PROJECT_CPD_OPS}

Use the following command to learn more about the nature of the problem. In the command, pod-name is the name of a pod whose status you want to learn.
```
oc describe pod-name -n ${PROJECT_CPD_OPS}
```
You can send the log files for the pod to IBM Support for further help. For more information, see Retrieving logs for the Watson Speech operator.

Some pods are in the pending state

Some Speech services pods are stuck in the Pending status.

Use the following command to learn more about the nature of the problem. In the command, pod-name is the name of a pod whose status is Pending.
```
oc describe pod-name -n ${PROJECT_CPD_INSTANCE}
```

Some possible causes of the problem follow:

Insufficient resources (memory and CPU) are available for the pod.
The pod is unable to pull the container image or images.

Installation of Watson Speech services fails

Installation of the Watson Speech services returns an error message of the following form:

TASK [utils : applying CR <speech-cr> for Watson Speech to Text] ********************************************
Tuesday 1 November 2022 17:44:48 +0000 (0:00:02.140) 0:01:08.881 ****** fatal: [localhost]: FAILED! =>
{"changed": false, "error": 422, "msg": "Failed to create object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},
\"status\":\"Failure\",\"message\":\"WatsonSpeech.speech.watson.ibm.com \\\\\"<speech-cr>\\\\\" is invalid: spec.tags: 
Required value\",\"reason\":\"Invalid\",\"details\":{\"name\":\"<speech-cr>",\"group\":\"speech.watson.ibm.com\",\"kind
\":\"WatsonSpeech\",\"causes\":[{\"reason\":\"FieldValueRequired\",\"message\":\"Required value\",\"field\":\"spec.tags\"}]},
\"code\":422}\\n'", "reason": "Unprocessable Entity", "status": 422}

This message indicates that all of the Speech microservices were set to false during initial installation of the Speech services with the param-file option. You must set at least one of the microservices to true for the installation to succeed. For more information, see Specifying additional installation options.

The Watson Speech operator is running but no microservices are installed

The Watson Speech operator is running but none of the Speech microservices is being installed.

Use the following command to verify that you created the Speech services custom resource in the desired namespace:
```
oc get WatsonSpeech -n ${PROJECT_CPD_INSTANCE}
```
If the command does not return any information in its output, you might need to create the custom resource. For more information, see Installing Watson Speech services.
If the custom resource exists, check the operator logs to determine why the microservices are not being installed. For more information about checking the operator logs, see Retrieving logs for the Watson Speech operator.

Jobs to pull Speech to Text models are taking a long time to finish

Jobs with names like <custom-resource-name>-stt-models-string pull the images for Speech to Text models from Docker and upload them to the MinIO datastore. The custom resource runs one job per enabled model.

The Speech to Text models are large. In some cases, these jobs can take from 20-25 minutes to complete. If you are concerned with the time it is taking to pull and upload the images, check the log files for your pods to see whether any errors have occurred. For more information, see Retrieving logs for pods.

Speech services initContainers are running for a long time

An initContainer is a special container that is run for a pod. All initContainers for a pod must be complete before the pod can start its regular container.

The Speech to Text and Text to Speech runtimes use an initContainer to wait for the MinIO datastore to be running and for all installed models and voices to be uploaded. The initContainer for either of the runtime microservices might run for as long as 30 minutes, especially in the case of an online install, while it pulls the images from the IBM registry. If the initContainer for either runtime microservice continues to run for more than 30 minutes, use the appropriate command to check its log file:

For the Speech to Text service, run the following command to check the status of the service's models:
```
oc logs -f runtime-pod-name -c wait4models -n ${PROJECT_CPD_INSTANCE}
```
For the Text to Speech service, run the following command to check the status of the service's voices:
```
oc logs -f runtime-pod-name -c wait4voices -n ${PROJECT_CPD_INSTANCE}
```

In both commands, runtime-pod-name specifies the name of the pod for the Speech to Text or Text to Speech runtime. For example, the name is something like speech-cr-stt-runtime-85957944ff-wrzl4 for the Speech to Text runtime or speech-cr-tts-runtime-858bd6f96f-g7dcw for the Text to Speech runtime.

Possible reasons for the initContainer to run for a long time include the following:

The runtime pod is not able to connect to MinIO. MinIO might be in the process of starting up or might have experienced an error. Wait for MinIO to start or check its log file. For more information, see Retrieving logs for pods.
MinIO might be waiting for all of the required models and voices to be installed. Wait for all of the models and voices to be uploaded. You can use the following command to check the status of the jobs that are uploading the models and voices:
```
oc get jobs -l 'app.kubernetes.io/component in (stt-models,tts-voices)' -n ${PROJECT_CPD_INSTANCE}
```
You can check the log files for the pods to determine whether a failure has occurred. Otherwise, wait for the upload jobs operation to complete.

The Watson Speech operator log indicates that the TLS secret was not created on time

The Watson Speech operator uses the certificate manager from the foundational services to create a secret name that can be used as a TLS certificate by the Speech services microservices. The following error in the Watson Speech operator log might indicate that microservices of the foundational services are not configured properly. In the message, <custom-resource-name> is the name of your Speech services custom resource.

Secret: <custom-resource-name>-instance-tls did not get created in time.

If this error occurs, contact IBM® Support for assistance.

MinIO pods fail to start or have errors

If the MinIO pods fail to start or generate errors, check the log files for the pods for possible problems. For more information, see Retrieving logs for pods.

Some possible causes of problems follow:

The required MinIO secret does not exist. Make sure you specified the correct secret in the custom resource.
The persistent volume claims (PVCs) were not bound. Use the following command to make sure that the PVCs are bound in your namespace:
```
oc get pvc -l "release in (${CUSTOM_RESOURCE_SPEECH}, ${CUSTOM_RESOURCE_SPEECH}-name-rabbitmq)" -n {{ ${PROJECT_CPD_INSTANCE} }}
```
If the PVCs are not bound, use the following command to describe the PVC to determine the cause, where <pvc-name> is the name of an unbound PVC:
```
oc describe <pvc-name> -n {{ ${PROJECT_CPD_INSTANCE} }}
```
Your storage classes might not have been created or possibly the storage class was omitted from or set incorrectly in the Speech services custom resource. Use the following command to make sure that the storage class was created in your namespace:
```
oc get storageclass | grep -e portworx-db-gp3-sc -e portworx-shared-gp3
```
This command uses the Portworx storage classes. Substitute the name of the block and file storage classes that are associated with the storage solution that you are using.
If you created the Speech services custom resource multiple times, a stale PVC from a previous custom resource might still exist. Remove the Speech services custom resource, then remove any stale PVCs for MinIO and RabbitMQ. You can then re-create the custom resource.
- For more information about removing the custom resource, see Uninstalling Watson Speech services..
- For more information about reinstalling the custom resource, see Installing Watson Speech services.

No status is reported for the Watson Speech service

The following command fails to report any status for the Watson Speech service (the response is empty):

oc get WatsonSpeech ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INSTANCE}

To determine the cause of this problem, do the following:

Use the following command to determine whether the Watson Speech operator pod is running:
```
oc get pods -n ${PROJECT_CPD_INSTANCE}
```
The operator must be running for the oc get WatsonSpeech command to report its status. If the status of the operator pod indicates that it is still in the process of starting, wait for the operator to start running. The operator can take from 20-60 minutes to create or apply changes to your custom resource.
Check the log file for the Watson Speech operator pod and check for any errors or problems. For more information, see Retrieving logs for the Watson Speech operator.

Some Speech services are not running

The following command reports the status of some Speech services as NotRunning:

oc get WatsonSpeech ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INSTANCE}

A status of NotRunning can indicate that the process is still starting up. It can take 20-60 minutes for the operator to complete and for the service to start running.

You also check the log file of the pod for any service that is not running. For more information, see Retrieving logs for pods.

Training of custom acoustic models is failing

When you attempt to train custom acoustic models for Speech to Text, the service reports the following error messages:

Unresponsive backend detected. Please try later.

This message indicates that the Speech to Text AM Patcher does not have sufficient resources to handle its requests. To increase the number of CPUs that are available to the AM Patcher, use the custom resource property named sttAMPatcher.resources.requestsCPU to increase the value of the property from 1 to 5.

Allocating more resources prevents this error and enables custom acoustic models to be trained as expected. Increasing the value of the property increases the size of the deployment.

PostgreSQL pods stuck in Terminating state on upgrade

When you upgrade the Watson Speech services, you might encounter an issue where the PostgreSQL pods become stuck in the Terminating state. If this problem occurs during your upgrade, perform the following steps to resolve the problem.

Use the following command to identify pods that remain in the Terminating state:
```
oc get pods -n ${PROJECT_CPD_INSTANCE} -o wide | awk {'print $1'}
```
Use the following command to set the environment variable pods to include the list of pods that remain in the Terminating state:
```
pods=$(oc get pods -n ${PROJECT_CPD_INSTANCE} -o wide | grep Terminating | awk {'print $1'})
```
Use the following command to delete the stuck pods so that the upgrade process can continue:
```
oc delete pod $pods -n ${PROJECT_CPD_INSTANCE} --force=true --grace-period=0
```

Upgrade to Watson Speech services version 4.6.3 and later fails to complete

When you upgrade to Watson Speech services version 4.6.3 and later, upgrade of the MinIO custom resource can fail because the MinIO backup job or the MinIO PVC creation job failed to be deleted in the previous upgrade procedure. The solution is to delete the backup and PVC creation jobs. The upgrade then proceeds normally. Perform the following steps to resolve the problem.

To check the status of the MinIO custom resource, issue the following command:

oc get MinioCluster ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INSTANCE}

The failed MinIO custom resource is identified by an entry of the following form:

<custom-resource-name>   MinioCluster   8d    4          ReleaseFailed   True     UpgradeError

You can run the following command to get more detailed information about the failure:

oc describe MinioCluster ${CUSTOM_RESOURCE_SPEECH} -n ${PROJECT_CPD_INSTANCE}

The custom resource returns a status message similar to the following:

[2:37 PM]   - lastTransitionTime: "2023-04-18T11:05:05Z"
    message: 'failed to upgrade release: pre-upgrade hooks failed: warning: Hook pre-upgrade
      ibm-minio/templates/minio-createpvc-job.yaml failed: jobs.batch "<custom-resource-name>-ibm-minio-create-pvc"
      already exists'
    reason: UpgradeError
    status: "True"
    type: ReleaseFailed

To delete the failed MinIO PVC creation job, issue the following command:

oc delete job ${CUSTOM_RESOURCE_SPEECH}-ibm-minio-create-pvc --namespace ${PROJECT_CPD_INSTANCE}

To determine whether the MinIO backup job remains undeleted, issue the following command:
```
oc get job --namespace ${PROJECT_CPD_INSTANCE} | grep ${CUSTOM_RESOURCE_SPEECH}-ibm-minio-backup
```
The MinIO backup job that is not deleted is identified by an entry of the following form:
```
<custom-resource-name>-ibm-minio-backup   1/1   3m25s   1d
```

To delete the backup job, issue the following command:

oc delete job ${CUSTOM_RESOURCE_SPEECH}-ibm-minio-backup --namespace ${PROJECT_CPD_INSTANCE}

Once you delete these jobs, upgrade continues and completes.

Upgrade to Watson Speech services version 4.6.0 and later leaves unneeded PostgreSQL pods

Prior to version 4.6.0, the PostgreSQL datastore was installed with all Watson Speech services deployments, but PostgreSQL was not used by the Speech to Text and Text to Speech runtime microservices. As of version 4.6.0, PostgreSQL is installed only if at least one of the following microservices is installed:

Speech to Text asynchronous microservice
Speech to Text customization microservice
Text to Speech customization microservice

When you upgrade from a version earlier than 4.6.0 to version 4.6.0 or later, unnecessary pods for the PostgreSQL datastore can remain in your environment. If you do not use the asynchronous or customization microservices listed previously, you can use the following procedure to delete the unnecessary PostgreSQL pods. Do not delete the PostgreSQL pods if you use the asynchronous or customization microservices.

To query for the presence and status of PostgreSQL pods, run the following command:
```
oc get pods -n ${PROJECT_CPD_INSTANCE} | grep ${CUSTOM_RESOURCE_SPEECH}-postgres
```
Three PostgreSQL pods exist. The command returns status similar to the following for each pod. Unused PostgreSQL pods are in the crashed state: CrashLoopBackOff.
```
zen   <custom-resource-name>-postgres-3   0/1   CrashLoopBackOff   206 (2m31s ago)   17h
```
If you use only the runtime microservices, use the following command to delete the unnecessary PostgreSQL pods and the associated PVCs:
```
oc get delete cluster ${CUSTOM_RESOURCE_SPEECH}-postgres -n ${PROJECT_CPD_INSTANCE}
```