Backing up and restoring Watson Speech services data

To recover from potential disasters, you must back up and be prepared to restore your data from the Speech services. You are responsible for understanding your data and how you use the services. You are also responsible for being prepared to restore your data in the event of data loss.

You need to back up your data only if you use one or more of the sttCustomization, ttsCustomization, and sttAync Speech microservices. These microservices are stateful. If you do not install or use any of these microservices, you do not need to create backups.

The Speech services store stateful data that must be backed up in the MinIO and PostgreSQL datastores:

All Speech to Text asynchronous HTTP callback URLs and the results of all recent asynchronous jobs (typically those submitted in the past week)
All Speech to Text customization data (custom language models and custom acoustic models)
All Text to Speech customization data (custom models and speaker models)

The frequency of your backups depends on factors such as the following:

How often you modify your Speech to Text and Text to Speech custom models and data, and how easy it is to re-create changes that might be lost.
How heavily you use the Speech to Text asynchronous HTTP interface, and how severely the loss of the results of completed jobs might impact your applications.

Whatever frequency you choose, employ it consistently to avoid the irreversible loss of data in case of a catastrophic failure.

Note: To back up and restore your data, you use the import and export utilities described in Importing and exporting Watson Speech services data. The utilities work directly on the stateful data contained in the MinIO and PostgreSQL datastores. The backup and restore topic refers to the import and export documentation for some details of the procedure. The import and export documentation refers to source and target clusters and deployments. For purposes of backup and restore:

A source cluster or deployment is one from which you are backing up data.
A target cluster or deployment is one to which you are restoring previously backed up data.

Permissions you need for these tasks:

You must be an administrator of the Red Hat® OpenShift® project for the clusters from which you want to back up data and to which you want to restore data.
You must have permission to access the PostgreSQL and MinIO authentication secrets and to exec to a PostgreSQL pod for both clusters.

Error using the OADP backup and restore utility when Watson Speech services are installed

Watson Speech services do not support the Cloud Pak for Data OADP backup and restore utility. If the Speech services are installed on a cluster, you might not be able to use the Cloud Pak for Data OADP backup and restore utility to back up other services that are installed on that cluster. This limitation applies to version 4.0.0 and later versions of the Speech services.

Before you begin

The import and export utilities are supported only on clusters that have the following configuration:
- Cloud Pak for Data 4.6.x
- Watson Speech services 4.6.x
- Red Hat OpenShift version depends on the version of Watson Speech services you are using:
  - For Watson Speech services versions 4.6.0 and 4.6.2, Red Hat OpenShift versions 4.8 and 4.10
  - For Watson Speech services version 4.6.3, Red Hat OpenShift version 4.10
  - For Watson Speech services version 4.6.4, Red Hat OpenShift versions 4.10 and 4.12
The import and export utilities consist of a set of Bash shell scripts that are required to complete the backup and restore procedures:
- import_export.sh saves data from one cluster so that it can be restored to the same or a different cluster. The same script is used for both backup and restore.
- transfer_ownership.sh transfers ownership of the data from the owning instance in one cluster to the owning instance in another cluster.
Each script supports a -h (--help) option to display information about the script and its usage.
You must run the scripts from a Linux terminal that has access to the Red Hat OpenShift project for the Speech services.

Backup and restore topics

For more information about backing up and restoring your data, see the following topics:

Downloading the import and export utilities
Backing up your data
- Preparing for backup
- Backing up your data
Restoring your data
- Preparing for restore
- Restoring your data
Transferring resource ownership

Downloading the import and export utilities

You can download the scripts of the import and export utilities in the following ways:

By cloning the following public GitHub repository with HTTP or SSH:

https://github.com/watson-developer-cloud/speech-import-export
By downloading a .zip file that contains a zipped version of the repository from the following URL:

https://github.com/watson-developer-cloud/speech-import-export/archive/refs/heads/master.zip

After downloading the speech-import-export-master.zip file, you must uncompress the file before you can use the scripts.

The structure of the repository or downloaded .zip file is a directory that has the following contents:

speech-import-export
├── import_export.sh
├── lib
│   └── utils.sh
└── transfer_ownership.sh

The scripts must be available to each Linux machine on which you plan to backup and restore data. (You do not call the utils.sh script directly.)

Backing up your data

The import and export utilities automatically back up all data for any instances of the Speech to Text and Text to Speech services that are provisioned in your deployment. You cannot back up data for only a specified instance or for only one of the two Speech services.

Preparing for backup

To back up your data, you need the following:

The name of your Speech services custom resource.
The name of the PostgreSQL datastore authentication secret that you used during installation of the Speech services. The default value is custom-resource-name-postgres-auth-secret.
The name of the MinIO datastore authentication secret that you used during installation of the Speech services. The default value is custom-resource-name-ibm-minio-auth.
The instance IDs of your service instances. You need the IDs to transfer ownership of the resources in the event of a catastrophic failure.

You can find the instance ID for a service instance in its URL under the Access information section in the Cloud Pak for Data console. For example, the following example shows two Speech services instances, one for Text to Speech and one for Speech to Text:

Example image that shows Speech services instances.

Clicking the name of the Speech to Text instance, speech-to-text-c248, causes the console to display the About this instance page with details about the instance. The instance ID in the following example is circled in the URL that is located under Access information:

Example image that shows detailed information and highlighted credentials for a selected Speech to Text instance.

You can use this procedure to find the instance IDs for all service instances in any deployment.

Backing up your data

From a Linux machine that has internet access, use the import_export.sh script to back up your data to a local directory:

Change directories to the root of the directory into which you downloaded or cloned the utilities:
```
cd speech-import-export
```
Run the import_export.sh script with the following options:
```
./import_export.sh export -c ${CUSTOM_RESOURCE_SPEECH} \
   -o <export_directory> -v <version> \
   -p <postgres_auth_secret_name> -m <minio_auth-secret-name> \
   -n ${PROJECT_CPD_INSTANCE} --no-quiesce
```
where the options and arguments provide the following information:

${CUSTOM_RESOURCE_SPEECH}

The required name of the Speech services custom resource from which the script is to back up data.

export_directory

The required path name of the directory in which the script is to save the backed up data. The script creates the directory if it does not already exist. If the directory does already exist but is not empty, the script overwrites the existing contents.

version

The version from which you are backing up data. Use 46 for Cloud Pak for Data version 4.6.x.

postgres_auth_secret_name

The Kubernetes secret that is used to authenticate to the PostgreSQL datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-postgres-auth-secret). You must provide the secret if it is different from the default value.

minio_auth_secret_name

The Kubernetes secret that is used to authenticate to the MinIO datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-ibm-minio-auth). You must provide the secret if it is different from the default value.

${PROJECT_CPD_INSTANCE}

The name of the project namespace in which the Speech services are installed. If the namespace is different from the current Red Hat OpenShift context, you must provide the Red Hat OpenShift namespace in which the Speech services are installed. Otherwise, you can omit the namespace.

--no-quiesce

Proceeds with the backup without first putting the service into read-only mode, so the backup takes less time. Do not use the option in production environments.
Create a file to record the instance IDs of your service instances. Save the file outside of the export_directory to which your data is backed up. Do not store the instance IDs in the export_directory. That directory has a well-defined structure that must remain unchanged. For more information about the format of the backup data, see Exported data format.
To store your backup data efficiently, you can compress the output directory and the file of instance IDs by using a lossless compression algorithm.
Store your backup data on a mounted volume or file system that resides on a different host from the datastores. Alternatively, you can copy the backup files to a safe location.

The script reports on its status as it creates the backup data. It shows its status for each individual datastore and then provides a final status report when the entire backup operation is complete. If the script reports no errors, the backup process completed successfully. For more information about what the script does when it backs up data, see What happens when you export data.

Restoring your data

The import and export utilities restore all data for all instances of the Speech to Text and Text to Speech services that were backed up. You cannot restore data for only a specified instance or for only one of the two Speech services.

Preparing for restore

To restore your data, you need the following:

The name of your Speech services custom resource.
The name of the PostgreSQL datastore authentication secret that you used during installation of the Speech services. The default value is custom-resource-name-postgres-auth-secret.
The name of the MinIO datastore authentication secret that you used during installation of the Speech services. The default value is custom-resource-name-ibm-minio-auth.
To restore your data, your deployment must already have installed the same Red Hat OpenShift and Speech services versions as the deployment from which the data was backed up. The deployment must run the same Speech microservices as the original deployment.
- For Speech to Text, the microservices can include sttRuntime, sttAsync, and sttCustomization.
- For Text to Speech, they can include ttsRuntime and ttsCustomization.
For example, if the deployment from which you backed up the data ran all of these microservices, the deployment to which you are restoring the data must also run all of the microservices. Otherwise, the necessary datastores might not exist, and the restore will fail.

Restoring your data

From a Linux machine that has internet access, use the import_export.sh script to restore the data that you previously backed up:

Determine the type of restore you are performing:
- If you are restoring your data to a new deployment, possibly to recover from a catastrophic failure, you must reinstall Red Hat OpenShift and the Speech services to duplicate the environment from which the data was backed up. For more information, see Installing Watson Speech services.
- If you are restoring your data to the same deployment, possibly to recover from corruption or to revert to a previous known good state, you do not need to re-create your deployment. So long as all of the microservices remain the same and have the same instance IDs, you can restore to the same existing deployment.
Change directories to the root of the directory into which you downloaded or cloned the utilities:
```
cd speech-import-export
```
Run the import_export.sh script with the following options:
```
./import_export.sh import -c ${CUSTOM_RESOURCE_SPEECH} \
   -o <import_directory> -v <version> \
   -p <postgres_auth_secret_name> -m <minio_auth-secret-name> \
   -n ${PROJECT_CPD_INSTANCE} --no-quiesce
```
where the options and arguments provide the following information:

${CUSTOM_RESOURCE_SPEECH}

The required name of the Speech services custom resource to which the script is to restore data.

import_directory

The required path name of the directory from which the script can access the previously backed up data. The directory must exist and must contain the backup data in the format in which it was created. If you compressed the backup data, you must uncompress the data before you can restore it.

version

The version to which you are restoring data. Use 46 for version 4.6.x of Cloud Pak for Data.

postgres_auth_secret_name

The Kubernetes secret that is used to authenticate to the PostgreSQL datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-postgres-auth-secret). You must provide the secret if it is different from the default value.

minio_auth_secret_name

The Kubernetes secret that is used to authenticate to the MinIO datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-ibm-minio-auth). You must provide the secret if it is different from the default value.

${PROJECT_CPD_INSTANCE}

The name of the project namespace in which the Speech services are installed. If the namespace is different from the current Red Hat OpenShift context, you must provide the Red Hat OpenShift namespace in which the Speech services are installed. Otherwise, you can omit the namespace.

--no-quiesce

Proceeds with the restore without first putting the service into read-only mode, so the restore takes less time. Do not use the option in production environments.
Determine whether you need to transfer ownership of the restored date:
- If you are restoring your data to a new deployment, you must transfer ownership of the restored resources to make the data accessible in the new deployment. To transfer ownership, complete the procedure in Transferring resource ownership. You can transfer ownership as soon as the import script is complete. You do not need to wait for the services to exit read-only mode to transfer ownership.
- If you are restoring your data to the same deployment, you do not need to transfer ownership. The original instance IDs already exist because the deployment already exists.

The script reports on its status as it restores data. It shows its status for each datastore and then provides a final status report when the entire restore operation is complete. If the script reports no errors, the restore process completed successfully. For more information about what the script does when it restores data, see What happens when you import data.

Transferring resource ownership

After you restore your data, all stateful data from the backed up Speech services deployment is present in the restored deployment. However, if you are restoring your data to a new deployment, the data is still owned by the original service instances (instance IDs) of the deployment from which it was backed up. To make it accessible to the service instances in the new deployment, you must transfer ownership of the resources to the corresponding service instances in the new deployment.

From a Linux machine that has internet access, use the transfer_ownership.sh script to transfer ownership of your resources:

Obtain the instance IDs of the service instances in the new deployment. To learn the instance IDs, use the procedure described in Preparing for backup.
Change directories to the root of the directory into which you downloaded or cloned the utilities:
```
cd speech-import-export
```
Run the transfer_ownership.sh script with the following options and arguments:
```
./transfer_ownership.sh <source_instance_ID> <target_instance_ID> \
   -c ${CUSTOM_RESOURCE_SPEECH} -v <version> [-p <postgres_auth_secret_name>] \
   -n ${PROJECT_CPD_INSTANCE}
```
where the options and arguments provide the following information:

source_instance_ID

The required instance ID of a service instance in the deployment from which you backed up the data.

target_instance_ID

The required instance ID of the corresponding service instance from the deployment to which you restored the data.

${CUSTOM_RESOURCE_SPEECH}

The required name of the Speech services custom resource to which the script is to transfer ownership.

version

The version to which you are transferring ownership. Use 46 for Cloud Pak for Data version 4.6..x

postgres_auth_secret_name

The Kubernetes secret that is used to authenticate to the PostgreSQL datastore to which you are transferring ownership. You can omit the authentication secret if is the same as the default value (custom-resource-name-postgres-auth-secret). You must provide the secret if it is different from the default value.

${PROJECT_CPD_INSTANCE}

The name of the project namespace in which the Speech services to which you are restoring are installed. If the namespace is different from the current Red Hat OpenShift context, you must provide the Red Hat OpenShift namespace in which the Speech services are installed. Otherwise, you can omit the namespace.
Repeat the previous step once for each instance ID for which you need to transfer ownership.

For example, suppose you need to transfer ownership from two source instance IDs, one for Speech to Text, 1624643480182330, and one for Text to Speech, 1624643447500603. Assume you are transferring ownership to the following two target instance IDs: for Speech to Text, 1624909653168795, and for Text to Speech, 1624909606530051. In this case, you would run the transfer_ownership.sh script twice, once for the Speech to Text instance and once for the Text to Speech instance:

./transfer_ownership.sh 1624643480182330 1624909653168795 \
   -c ${CUSTOM_RESOURCE_SPEECH} -v version -p <postgres_auth_secret_name> \
   -n ${PROJECT_CPD_INSTANCE}

./transfer_ownership.sh 1624643447500603 1624909606530051 \
   -c ${CUSTOM_RESOURCE_SPEECH} -v version -p <postgres_auth_secret_name> \
   -n ${PROJECT_CPD_INSTANCE}

After transferring ownership of every instance ID, your restored data is ready to use. As a quick verification test, you can use the URL and credentials for the instance IDs to list asynchronous jobs (for Speech to Text) or custom models (for Speech to Text or Text to Speech) in the new deployment. If you successfully transferred ownership, the jobs and custom models are present in the new deployment. You then need to update your applications to use the URLs and credentials for the new instances.