Importing and exporting data

Import and export lets you transfer all state associated with a Watson Speech services deployment from a source cluster to a target cluster. The procedure can be used, for example, to move data from a development cluster to a production cluster or when in-place upgrades are not possible or desired.

The import and export utilities export and import all stateful Speech services data and transfer ownership of the data from a source deployment to a target deployment. The utilities work directly on the MinIO and PostgreSQL datastores to transfer the following data:

All Speech to Text asynchronous HTTP callback URLs and the results of all recent asynchronous jobs (typically those submitted in the past week)
All Speech to Text customization data (custom language models and custom acoustic models)
All Text to Speech customization data (custom models and speaker models)

The import and export utilities are used as the foundation of other procedures. For more information about performing those procedures, see the following topics:

Backing up and restoring data

Permissions you need for these tasks:

You must be an administrator of the Red Hat® OpenShift® project for your source and target clusters to import and export data for your Speech services.
You must have permission to access the PostgreSQL and MinIO authentication secrets and to exec to a PostgreSQL pod for both the source and target clusters.

Before you begin

You need to know the following to use the utilities:

The import and export utilities are supported only on clusters that have the following configuration:
- Red Hat OpenShift 4.6 or 4.8
- Cloud Pak for Data 4.x
- Watson Speech services 4.x
The import and export utilities consist of a set of Bash shell scripts that are required to complete the import and export procedure:
- import_export.sh moves data from one cluster to another. The same script is used for both import and export.
- transfer_ownership.sh transfers ownership of the data from the owning instance in the source cluster to the owning instance in the target cluster.
Each script supports a -h (--help) option to display information about the script and its usage.
You must run the scripts from a Linux terminal that has access to the Red Hat OpenShift project for the Speech services.

Import and export topics

For more information about importing and exporting your data, see the following topics:

Downloading the import and export utilities
Exporting your data from a source cluster
Importing your data to a target cluster
Transferring resource ownership

Downloading the import and export utilities

You can download the scripts of the import and export utilities in the following ways:

By cloning the following public GitHub repository with HTTP or SSH:

https://github.com/watson-developer-cloud/speech-import-export
By downloading a .zip file that contains a zipped version of the repository from the following URL:

https://github.com/watson-developer-cloud/speech-import-export/archive/refs/heads/master.zip

After downloading the speech-import-export-master.zip file, you must uncompress the file before you can use the scripts.

The structure of the repository or downloaded .zip file is a directory that has the following contents:

speech-import-export
├── import_export.sh
├── lib
│   └── utils.sh
└── transfer_ownership.sh

The scripts must be available to each Linux machine on which you plan to export and import data. (You do not call the utils.sh script directly.)

Exporting your data from a source cluster

The import and export utilities automatically export all data for any instances of Speech to Text and Text to Speech services that are provisioned in the source deployment. You cannot export data for only a specified instance or for only one of the two Speech services.

Preparing for export

To export data from your source cluster, you need the following:

The name of your Speech services custom resource for the source deployment.
The name of the PostgreSQL datastore authentication secret that you used during installation of the Speech services of the source deployment. The default value is custom-resource-name-postgres-auth-secret.
The name of the MinIO datastore authentication secret that you used during installation of the Speech services of the source deployment. The default value is custom-resource-name-ibm-minio-auth.
The instance IDs of the service instances that are associated with the source deployment. After you import your data to a target deployment, you need the IDs from the source deployment to transfer ownership of the resources to the instances in the target deployment.

You can find the instance ID for a service instance in its URL under the Access information section in the Cloud Pak for Data console. For example, the following example shows two Speech services instances, one for Text to Speech and one for Speech to Text:

Example image that shows Speech services instances.

Clicking the name of the Speech to Text instance, speech-to-text-c248, causes the console to display the About this instance page with details about the instance. The instance ID in the following example is circled in the URL that is located under Access information:

Example image that shows detailed information and highlighted credentials for a selected Speech to Text instance.

You can use this procedure to find the instance IDs for all service instances in the source and target deployments.

Exporting your data

From a Linux machine that has internet access, use the import_export.sh script to export your data to a local directory:

Change directories to the root of the directory into which you downloaded or cloned the utilities:
```
cd speech-import-export
```
Run the import_export.sh script with the following options:
```
./import_export.sh export -c <custom_resource_name> \
   -o <export_directory> -v <version> \
   -p <postgres_auth_secret_name> -m <minio_auth-secret-name>
   -n ${PROJECT_CPD_INSTANCE}
```
where the options and arguments provide the following information:
- custom_resource_name is the required name of the Speech services custom resource from which the script is to export data.
- export_directory is the required path name of the directory in which the script is to save the exported data. The script creates the directory if it does not already exist. If the directory does already exist but is not empty, the script overwrites the existing contents.
- version is 40, which indicates version 4.0.x of Cloud Pak for Data from which you are exporting data.
- postgres_auth_secret_name is the Kubernetes secret that is used to authenticate to the source PostgreSQL datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-postgres-auth-secret). You must provide the secret if it is different from the default value.
- minio_auth_secret_name is the Kubernetes secret that is used to authenticate to the source MinIO datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-ibm-minio-auth). You must provide the secret if it is different from the default value.
- ${PROJECT_CPD_INSTANCE} is the name of the project namespace in which the source Speech services are installed. If the namespace is different from the current Red Hat OpenShift context, you must provide the Red Hat OpenShift namespace in which the source Speech services are installed. Otherwise, you can omit the namespace.

The script reports on its status as it creates the export data. It shows its status for each individual datastore and then provides a final status report when the entire export operation is complete. If the script reports no errors, the export process completed successfully.

What happens when you export data

When you use it to export data, the import_export.sh script does the following:

Automatically downloads the following required programs when you run the script for the first time on any machine:
- mc - The MinIO client
- cpdbr - The Cloud Pak for Data backup and restore utility, which puts the Speech services into read-only mode
Waits for any asynchronous recognition jobs that are processing or queued to complete.
Puts the Speech services in read-only mode. This can take up to 30 minutes.
Uses the mc program to copy the contents of the MinIO datastore to the export_directory/minio directory.
Connects to a PostgreSQL pod and uses the psql utility to export a binary dump of the datastore to the export_directory/postgres directory.
Takes the Speech services out of read-only mode. It may take up to 30 minutes for the Speech services to be completely ready to serve requests. Read-only operations, such as speech recognition, speech synthesis, and listing information about models and voices, remain available while a service is in read-only mode. Changes to stateful data, such as creating Speech to Text asynchronous jobs, modifying Speech to Text or Text to Speech custom models, and creating new custom models for either service, are unavailable until the service exits read-only mode.

Exported data format

The exported data adheres to a well-defined directory structure that you must preserve for a subsequent import to succeed. A sample of the directory structure of the exported data follows:

export-directory
├── minio
│   └── stt-customization-icp
│       └── customizations
│           ├── 3755a526-8714-4cac-be0a-f9a9832d0f1b
│           │   ├── allpatches-untarred.zip
│           │   ├── data
│           │   │   └── audio1.1
│           │   ├── patch_acoustic.en-US_BroadbandModel.v2020-01-16
│           │   └── patch_acoustic.en-US_BroadbandModel.v2020-01-16.trainSnapshotDir
│           │       ├── features-000000-000000.tar.gz
│           │       ├── features.txt
│           │       └── train.snapshot
│           └── d861fc22-8a07-41a6-89b5-4543a217b17f
│               ├── allpatches-untarred.zip
│               ├── data
│               │   └── healthcare.1
│               └── patch.en-US_BroadbandModel.v2020-01-16
└── postgres
    ├── stt-async.export.dump
    ├── stt-customization.export.dump
    └── tts-customization.export.dump

If you need to move the exported data to a new machine to import it to the target deployment, you can compress the output directory for the transfer by using a lossless compression algorithm.

Importing your data to a target cluster

The import and export utilities import all data for any instances of the Speech to Text and Text to Speech services that were exported from the source deployment. You cannot import data for only a specified instance or for only one of the two Speech services.

Preparing for import

To import data to your target cluster, you need the following:

The name of your Speech services custom resource for the target deployment.
The name of the PostgreSQL datastore authentication secret that you used during installation of the Speech services of the target deployment. The default value is custom-resource-name-postgres-auth-secret.
The name of the MinIO datastore authentication secret that you used during installation of the Speech services of the target deployment. The default value is custom-resource-name-ibm-minio-auth.
Your target deployment must already have installed the same Red Hat OpenShift and Speech services versions as the source deployment. The target deployment must run the same Speech microservices as the source deployment.
- For Speech to Text, the microservices can include sttRuntime, sttAsync, and sttCustomization.
- For Text to Speech, they can include ttsRuntime and ttsCustomization.
For example, if your source deployment ran all of these microservices, the target deployment must also run all of the microservices. Otherwise, the necessary datastores might not exist, and the import will fail.

Importing your data

From a Linux machine that has internet access, use the import_export.sh script to import data that you have previously used the script to export:

Change directories to the root of the directory into which you downloaded or cloned the utilities:
```
cd speech-import-export
```
Run the import_export.sh script with the following options:
```
./import_export.sh import -c <custom_resource_name> \
   -o <import_directory> -v <version> \
   -p <postgres_auth_secret_name> -m <minio_auth-secret-name> \
   -n ${PROJECT_CPD_INSTANCE}
```
where the options and arguments provide the following information:
- custom_resource_name is the required name of the Speech services custom resource to which the script is to import data.
- import_directory is the required path name of the directory from which the script can access the previously exported data. The directory must exist and must contain the exported data in the format in which it was created. If you compressed the export data, you must uncompress the data before you can import it.
- version is 40, which indicates version 4.0.x of Cloud Pak for Data to which you are importing data.
- postgres_auth_secret_name is the Kubernetes secret that is used to authenticate to the target PostgreSQL datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-postgres-auth-secret). You must provide the secret if it is different from the default value.
- minio_auth_secret_name is the Kubernetes secret that is used to authenticate to the target MinIO datastore. You can omit the authentication secret if is the same as the default value (custom-resource-name-ibm-minio-auth). You must provide the secret if it is different from the default value.
- ${PROJECT_CPD_INSTANCE} is the name of the project namespace in which the target Speech services are installed. If the namespace is different from the current Red Hat OpenShift context, you must provide the Red Hat OpenShift namespace in which the target Speech services are installed. Otherwise, you can omit the namespace.
When the import script completes, you must transfer ownership of the imported resources to make the data accessible in the target deployment. To transfer ownership, complete the procedure in Transferring resource ownership. You can transfer ownership as soon as the import script is complete. You do not need to wait for the target services to exit read-only mode to transfer ownership.

The script reports on its status as it imports data. It shows its status for each datastore and then provides a final status report when the entire import operation is complete. If the script reports no errors, the import process completed successfully.

What happens when you import data

When you use it to import data, the import_export.sh script does the following:

Automatically downloads the following required programs when you run the script for the first time on any machine:
- mc - The MinIO client
- cpdbr - The Cloud Pak for Data backup and restore utility, which puts the Speech services into read-only mode
Waits for any asynchronous recognition jobs that are processing or queued to complete.
Puts the Speech services in read-only mode. This can take up to 30 minutes.
Uses the mc program to copy the contents of the MinIO datastore from the import_directory/minio directory.
Connects to a PostgreSQL pod and uses the psql utility to import the binary dump of the datastore from the import_directory/postgres directory.
Takes the Speech services out of read-only mode. It may take up to 30 minutes for the Speech services to be completely ready to serve requests. Read-only operations, such as speech recognition, speech synthesis, and listing information about models and voices, remain available while a service is in read-only mode. Changes to stateful data, such as creating Speech to Text asynchronous jobs, modifying Speech to Text or Text to Speech custom models, and creating new custom models for either service, are unavailable until the service exits read-only mode.

Transferring resource ownership

After you complete the import procedure in the previous section, all stateful data from the source Speech services deployment is present in the target deployment. However, the data is still owned by the original service instances (instance IDs) in the source deployment. To make it accessible to the service instances in the target deployment, you must transfer ownership of the resources to the corresponding service instances in the target deployment.

You can transfer ownership of resources in one of two ways:

By transferring ownership from one instance ID in the source deployment to one corresponding instance ID in the target deployment. This is the recommended approach because it is the most direct procedure.
By transferring ownership from multiple instance IDs in the source deployment to a single instance ID in the target deployment. This approach combines multiple former instance IDs into a single new instance ID.
Note: You cannot map the resources associated with a single instance ID in the source deployment to multiple instance IDs in the target deployment.

From a Linux machine that has internet access, use the transfer_ownership.sh script to transfer ownership of your resources:

Obtain the instance IDs of the service instances in the target deployment. To learn the instance IDs, use the procedure described in Preparing for export.
Change directories to the root of the directory into which you downloaded or cloned the utilities:
```
cd speech-import-export
```
Run the transfer_ownership.sh script with the following options and arguments:
```
./transfer_ownership.sh <source_instance_ID> <target_instance_ID> \
   -c <custom_resource_name> -v <version> [-p <postgres_auth_secret_name>] \
   -n ${PROJECT_CPD_INSTANCE}
```
where the options and arguments provide the following information:
- source_instance_ID is the required instance ID of a service instance from the source deployment.
- target_instance_ID is the required instance ID of the corresponding service instance from the target deployment.
- custom_resource_name is the required name of the Speech services custom resource to which the script is to transfer ownership.
- version is 35 or 40, which indicates the version of Cloud Pak for Data to which you are transferring ownership.
- postgres_auth_secret_name is the Kubernetes secret that is used to authenticate to the PostgreSQL datastore to which you are transferring ownership. You can omit the authentication secret if is the same as the default value (custom-resource-name-postgres-auth-secret for version 4.0.x, user-provided-postgressql for version 3.5). You must provide the secret if it is different from the default value.
- ${PROJECT_CPD_INSTANCE} is the name of the project namespace in which the target Speech services are installed. If the namespace is different from the current Red Hat OpenShift context, you must provide the Red Hat OpenShift namespace in which the target Speech services are installed. Otherwise, you can omit the namespace.
Repeat the previous step once for each instance ID for which you need to transfer ownership. If you are transferring ownership from multiple instance IDs to a single instance ID, use the same target_instance_ID for each source_instance_ID.

For example, suppose you need to transfer ownership from two source instance IDs, one for Speech to Text, 1624643480182330, and one for Text to Speech, 1624643447500603. Assume you are transferring ownership to the following two target instance IDs: for Speech to Text, 1624909653168795, and for Text to Speech, 1624909606530051. In this case, you would run the transfer_ownership.sh script twice, once for the Speech to Text instance and once for the Text to Speech instance:

./transfer_ownership.sh 1624643480182330 1624909653168795 \
   -c <custom_resource_name> -v version -p <postgres_auth_secret_name> \
   -n ${PROJECT_CPD_INSTANCE}

./transfer_ownership.sh 1624643447500603 1624909606530051 \
   -c <custom_resource_name> -v version -p <postgres_auth_secret_name> \
   -n ${PROJECT_CPD_INSTANCE}

After transferring ownership of every source instance ID to a target instance ID, your imported data is ready to use. As a quick verification test, you can use the URL and credentials for the target instance IDs to list asynchronous jobs (for Speech to Text) or custom models (for Speech to Text or Text to Speech) in the target deployment. If you successfully transferred ownership, the jobs and custom models from the source deployment are present in the target deployment. You then need to update your applications to use the URLs and credentials for the target instances.