Prerequisites to migrating data (Watson Knowledge Catalog)

Before you migrate data from Information Server to Cloud Pak for Data, you must complete several prerequisite steps.

Optional: Stop synchronization of information assets to the default catalog

Stop the synchronization of information assets only when you are importing large volumes of data. In the synchronization process, information assets are synchronized within the Watson™ Knowledge Catalog repository services (Xmeta and CAMS). If you migrate a large amount of data, the synchronization process might take a significant amount of time and slow down the overall migration process. You can optionally stop the synchronization by deleting the default catalog or the catalog that you configured for sharing assets. After the migration is finished, you can resume the synchronization by recreating the catalog.

To delete the catalog, complete these steps:
  1. In Cloud Pak for Data, go to Administration > Catalogs.
  2. Open the Catalog Setup tab and check which catalog is configured for sharing assets with Information Governance Catalog. It is usually Default Catalog.
  3. Go to Catalogs > All catalogs and find this catalog.
  4. From the menu, select Delete.
After you finish the migration, recreate the catalog.

Optional: Disable automatic profiling of data assets

When a data asset is added to a catalog, it is automatically profiled to get additional metadata. During data migration, the volume of data added to the catalog is large. You can temporarily disable automatic profiling to speed up the migration process and later enable it again.

To disable automatic profiling, complete these steps:
  1. In Cloud Pak for Data, go to Administration > Catalogs.
  2. Open the Catalog Setup tab and check which catalog is configured for sharing assets with Information Governance Catalog. It is usually Default Catalog.
  3. On the Overview tab, find this catalog and open it.
  4. Go to the Settings tab and clear the option Automatically create profiles for data assets. If the option is disabled, enable and disable it again to make sure it is disabled.
After you finish the migration, enable automatic profiling again.

Make sure the default catalog in Cloud Pak for Data does not contain user data

To prevent the creation of duplicates, the target default catalog where the data will be migrated to cannot contain any user-defined data.

If you have data in your catalog and want to delete all the data, you can use the following method. To be able to complete the procedure, you must have the OpenShift® Container Platform command-line interface (CLI) installed. For information about installing the CLI, see the instructions in the OpenShift Container Platform documentation.
  1. Log in to the wdp-db2 pod:
    ./oc exec -it c-db2oltp-wkc-db2u-0 /bin/bash
  2. Run the following commands:
    su - db2inst1
    
    db2 connect to BGDB
    db2 "set schema bg"
    db2 "drop table \"flyway_schema_history\""
    db2 "update GLOSSARY_STORAGE_VERSION set version = '0.0'"
    db2 "delete from SCHEMAVERSION"
  3. Restart the wkc-glossary-service pod. For example:
    oc delete wkc-glossary-service-849fdd8cd7-6nq52

Delete predefined data classes

If you have any predefined data classes in your target Cloud Pak for Data environment, remove them. When you import data classes from Information Server, these predefined data classes are imported as well. It is especially important when you modified predefined data classes in your source environment.

Install CLI for Red Hat OpenShift

If you don’t have the OpenShift Container Platform CLI , you must install it to be able to run various commands needed to complete the migration process. For information about installing the CLI, see the instructions in the OpenShift Container Platform documentation.

You must have appropriate roles to run the following commands:
  • oc login
  • oc edit
  • oc delete
  • oc get pods
  • oc cp
  • oc exec
  • oc set
For more information about roles, see the roles descriptions in the OpenShift Container Platform documentation.

Configure Redis settings

Redis is used by many microservices to cache information. Before you start the migration, you must configure its settings so that it doesn’t run out of memory. Complete these steps:
  1. Edit the value of the maxmemory property in the redis.conf file. Run this command:
    oc edit cm redis-ha-configmap
    Change the value to "1573741824". It must be enclosed in double quotation marks.
  2. Increase the Redis memory limit to 2 GB by running this command:
    oc set resources sts redis-ha-server -c redis --limits=memory=2Gi
  3. Update the CAMS OMRS cache TTL setting by running this command:
    oc set env deploy catalog-api -c catalog-api omrs_cache_ttl_days=1
To verify this setting, open this URL:
https://<target_host_name>/k8s/ns/wkc/deployments/catalog-api/environment
The omrs_cache_ttl_days property should be set to the value 1.

Increase available resources for services in Cloud Pak for Data

Before you start the migration, you must increase the memory limits for the Cassandra, Solr, event consumer, iis-services, and conductor services. The increased limits are required for operations like imports to ensure optimal performance.

Complete these steps:
  1. Log in to the Red Hat® OpenShift cluster with this command:
    oc login
  2. Modify the HEAP SETTINGS section of the Cassandra JVM options.
    1. Run this command:
      oc -n ${PROJECT_CPD_INSTANCE} edit cm cassandra-jvm-options
    2. Modify the values. The -Xms and -Xmx options must have the same value. The value of the -Xmn option must be four times smaller than the value of the -Xmx option. The following excerpt shows recommended values. If you have more resources, you can increase the values.
      #################
      # HEAP SETTINGS #
      #################
      
      # Heap size is automatically calculated by cassandra-env based on this
      # formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
      # That is:
      # - calculate 1/2 ram and cap to 1024MB
      # - calculate 1/4 ram and cap to 8192MB
      # - pick the max
      #
      # For production use you may wish to adjust this for your environment.
      # If that's the case, uncomment the -Xmx and Xms options below to
      # override the automatic calculation of JVM heap memory.
      #
      # It is recommended to set min (-Xms) and max (-Xmx) heap sizes to
      # the same value to avoid stop-the-world GC pauses during resize, and
      # so that we can lock the heap in memory on startup to prevent any
      # of it from being swapped out.
      #-Xms1024M
      #-Xmx1024M
      -Xms4096M
      -Xmx4096M
      # Young generation size is automatically calculated by cassandra-env
      # based on this formula: min(100 * num_cores, 1/4 * heap size)
      #
      # The main trade-off for the young generation is that the larger it
      # is, the longer GC pause times will be. The shorter it is, the more
      # expensive GC will be (usually).
      #
      # It is not recommended to set the young generation size if using the
      # G1 GC, since that will override the target pause-time goal.
      # More info: http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html
      #
      # The example below assumes a modern 8-core+ machine for decent
      # times. If in doubt, and if you do not particularly want to tweak, go
      # 100 MB per physical CPU core.
      #-Xmn256M
      -Xmn1024M
  3. Modify the resource requests and limits for the Cassandra StatefulSet.
    1. Run this command:
      oc -n ${PROJECT_CPD_INSTANCE} edit sts cassandra
    2. Modify the values. The memory request value must be equal to the value of the -Xmx option. The memory limit value must be four times bigger than the request value. The following excerpt shows recommended values.
        resources:
          limits:
            cpu: 2
            memory: 16Gi
          requests:
            cpu: 1
            memory: 4Gi
      
    3. Restart the Cassandra pod by running this command:
      oc -n ${PROJECT_CPD_INSTANCE} delete pod cassandra-0
  4. Modify the HEAP SETTINGS section of the iis-services configuration.
    1. Run this command:
      oc -n ${PROJECT_CPD_INSTANCE} edit cm iis-server
    2. Search for the -Xmx option and change its value. The recommended value is -Xmx16384m.
    3. Find the name of the iis-services pod. Run this command:
      oc get pods | grep iis-services
    4. Restart the iis-services pod. Use the name that was returned by the command in previous step. For example:
      oc -n ${PROJECT_CPD_INSTANCE} delete pod iis-services
  5. Modify the resource requests and limits for the Solr StatefulSet.
    1. Run this command:
      oc -n ${PROJECT_CPD_INSTANCE} edit sts solr
    2. Modify the values. The following excerpt shows recommended values.
        resources:
          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 1
            memory: 1Gi
      
    3. Restart the Solr pod by running this command:
      oc -n ${PROJECT_CPD_INSTANCE} delete pod solr-0
  6. Modify the resource request and limit values for the event consumer StatefulSet.
    1. Run this command:
      oc -n ${PROJECT_CPD_INSTANCE} edit sts shop4info-event-consumer
    2. Modify the values. The following excerpt shows recommended values.
        resources:
          limits:
            cpu: 3
            memory: 4Gi
          requests:
            cpu: 200m
            memory: 1Gi
      
    3. Restart the event consumer pod by running this command:
      oc -n ${PROJECT_CPD_INSTANCE} delete pod shop4info-event-consumer-0
  7. Modify the resource limits for the conductor StatefulSet.
    1. Run this command:
      oc -n ${PROJECT_CPD_INSTANCE} edit sts is-en-conductor
    2. Modify the values. The following excerpt shows recommended values.
        resources:
          limits:
            cpu: 6
            memory: 16Gi 
      
    3. Restart the conductor pod by running this command:
      oc -n ${PROJECT_CPD_INSTANCE} delete pod is-en-conductor-0

Increase the size of the Db2 secondary log

If you want to import 50,000 glossary assets or more, increase the size of the Db2® secondary log.
  1. Search for the Db2 pod (wdp-db2-0) name, use ‘db2’ as the search string.
    oc get pods | grep db2
  2. Log in to the Db2 pod.
    oc exec -it wdp-db2-0 bash
  3. Switch to the db2inst1 user:
    su - db2inst1
  4. Run the following command:
    db2 "update db cfg for ilgdb using logsecond max_num_allowed"

    The value you set for max_num_allowed is the maximum number of secondary log files that can be created and is usually calculated as 256 minus the number of primary log files. For more information about the logsecond configuration parameter, see the Db2 documentation.

Configure IOPS settings for the NFS server

Configure the NFS server to have at least 10 IOPS. For more information, see the Adjusting IOPS topic in the IBM Cloud documentation.

Configure the timeout values for importing data

When you import large amounts of data, it is recommended to increase timeout values in the target Cloud Pak for Data environment. Complete these steps:
  1. Search for the conductor pod (is-en-conductor-0) name, use ‘conductor’ as the search string.
    oc get pods | grep conductor
  2. Log in to the conductor pod.
    oc exec -it is-en-conductor-0 bash
  3. Navigate to /opt/IBM/InformationServer/ASBNode/eclipse/plugins/com.ibm.iis.client/iis.client.site.properties. Open the file and add the following property:
    com.ibm.iis.http.soTimeout=36000000
  4. Search for the iis-services pod (iis-services) name, use ‘services’ as the search string.
    oc get pods | grep services
  5. Log in to the iis-services pod.
    oc exec -it iis-services bash
  6. Run the following commands:
    /opt/IBM/InformationServer/ASBServer/bin/iisAdmin.sh -set -key com.ibm.iis.gov.vr.setting.maxObjectsInMemory -value 4000000
    /opt/IBM/InformationServer/ASBServer/bin/iisAdmin.sh -set -key com.ibm.iis.gov.xFrameOptions -value SAMEORIGIN
  7. Change the value of the Xmx option in the configMap file.
    1. Run the following command:
      oc -n ${PROJECT_CPD_INSTANCE} edit cm iis-server
    2. Modify the -Xmx option to have the -Xmx16384m value.
    3. Find the name of the iis-services pod. Run this command:
      oc get pods | grep iis-services
    4. Restart the iis-services pod. Use the name that was returned by the command in previous step. For example:
      oc -n ${PROJECT_CPD_INSTANCE} delete pod iis-services
  8. Stop the Information Server server by running the following command:
    /opt/IBM/InformationServer/wlp/bin/server stop iis
  9. Navigate to the opt/IBM/InformationServer/wlp/usr/servers/iis/server.xml file. Open the file and configure the options to the following values:
    <httpSession ... invalidationTimeout="3600" ... />
    <ltpa expiration="7600m"/>
    <transaction ... clientInactivityTimeout="36000" propogatedOrBMTTranLifetimeTimeout="72000" totalTranLifetimeTimeout="72000" ... />
  10. Start the Information Server server again by running the following command:
    /opt/IBM/InformationServer/wlp/bin/server start iis

Create users in the target Cloud Pak for Data system

Before you start the migration, you must create Information Server users in Cloud Pak for Data manually. Complete these steps:
  1. In Cloud Pak for Data, go to Administration > User management.
  2. Click New user.
  3. Provide the required information and save the changes.
Important:
  • All user names in Cloud Pak for Data are always in lower case. As a result, if the user names in the source system contained any capital letter, the associations between such users and assets (properties like steward or created by) are ignored during migration. No workaround is available, you must recreate these associations manually.
  • To preserve the associations between stewards and assets, you must add the Data Steward role to re-created users in Cloud Pak for Data. This is valid only for users whose user names in the source system don’t contain capital letters.
    The re-created users should log in to Cloud Pak for Data at least once before you run the migration. Otherwise, you must manually add those users as stewards in Information Governance Catalog before running the migration. To manually add those users, log in to Information Governance Catalog by entering this URL in your browser:
    https://<source-host-name>/ibm/iis/igc/
    Then, go to the Administration page.
For information about roles and permissions in Cloud Pak for Data, see Managing users.
The following table contains information about Cloud Pak for Data permissions and the equivalent Information Server user roles.
Table 1. Roles and permissions
Information Server role Cloud Pak for Data role or permission
  • Information Governance Catalog User
  • Data Preview Service User
View information assets
  • Suite Administrator
  • Information Governance Catalog Information Asset Administrator
  • Information Analyzer Project Administrator
  • Information Analyzer Data Administrator
  • Information Governance Catalog Glossary Administrator
Administrator role
Information Governance Catalog User Access governance artifacts
(No equivalent role) Manage governance categories
  • Common Metadata Importer or Common Metadata Administrator
  • Information Analyzer Data Administrator
  • Data Operator role at the workspace level
  • Business Analyst at the workspace level
Manage asset discovery
(No equivalent role) Manage governance workflows
  • Information Governance Catalog Information Asset Administrator
  • Information Governance Catalog Information Asset Author
  • Data Preview Service User
Manage information assets
Common Metadata Administrator Manage metadata
  • Rules Administrator
  • Rules Author
  • Rules Manager
  • Information Analyzer Data Administrator
  • Information Analyzer Project Administrator
Manage data quality
Information Governance Catalog User Access governance artifacts
  • Rules User
  • Information Analyzer User
Access data quality

Install native connectors

You must install the following native connectors to be able to import metadata and run data discovery:
  • Db2 connector
  • Netezza® connector
Db2 connector
Complete the following steps:
  1. Download the installation files install.sh and db2_client.tar.gz from Fix Central.
  2. Copy the files to the /tmp directory on Cloud Pak for Data.
  3. Get the name of the conductor pod by running this command.
    oc get pods -n ${PROJECT_CPD_INSTANCE}| grep conductor
    The output looks similar to this example, where the pod name is indicated in bold.
    is-en-conductor-0 1/1 Running 0 1d
  4. Copy the files to the conductor pod by running this command:
    oc cp /tmp/install.sh db2_client.tar.gz <project-name>/is-en-conductor-0:/tmp
  5. Log in to the conductor pod by running this command:
    oc -n ${PROJECT_CPD_INSTANCE} exec -it is-en-conductor-0 bash
  6. Check whether the mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/ directory exists by running this command:
    [root@is-en-conductor-0 EngineClients]# ls /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
    If the directory doesn’t exist, create it and navigate to it by running these commands:
    mkdir -p /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
    cd /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
  7. Copy the install.sh and db2_client.tar.gz files to this directory by running this command:
    cp /tmp/install.sh /tmp/db2_client.tar.gz /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
  8. Create a new directory by running this command:
    mkdir db2_client
  9. Extract the db2_client.tar.gz file.
    [root@is-en-conductor-0 EngineClients]# tar -xvf db2_client.tar.gz
  10. Edit the db2client.rsp file to contain a Db2 install path, for example mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients.
  11. Run the install.sh file.
    [root@is-en-conductor-0 EngineClients]# install.sh
  12. Print the system path to the current directory. The output is shown in bold.
    [root@is-en-conductor-0]# pwd
    /home/dsadm/sqllib
  13. Set up your environment by running this command:
    source db2profile
  14. Get the IP address of the metadata repository (XMETA) docker. Run this command:
    [root@is-en-conductor-0]# ifconfig
  15. Run the CATALOG TCPIP NODE command. Use the IP address that you retrieved in the previous step. For example:
    [root@is-en-conductor-0 sqllib]# db2 "catalog tcpip node docker remote 192.0.2.2 server 50000"
  16. Run the CATALOG DATABASE command:
    [root@is-en-conductor-0 sqllib]# db2 "catalog database xmeta at node docker"
  17. Connect to the metadata repository database:
    [root@is-en-conductor-0 sqllib]# db2 connect to xmeta user db2inst1 using isadmin
Netezza connector
Complete the following steps:
  1. Download the installation file nz-linuxclient-v7.0.3-P2.tar.gz from Fix Central.
  2. Copy the file to the /tmp directory on Cloud Pak for Data.
  3. Get the name of the conductor pod by running this command.
    oc get pods -n ${PROJECT_CPD_INSTANCE}| grep conductor
    The output looks similar to this example, where the pod name is indicated in bold.
    is-en-conductor-0 1/1 Running 0 1d
  4. Copy the installation file to the conductor pod by running this command:
    oc cp /tmp/nz-linuxclient-v7.0.3-P2.tar.gz <project-name>/is-en-conductor-0:/tmp
  5. Log in to the conductor pod by running this command:
    oc -n ${PROJECT_CPD_INSTANCE} exec -it is-en-conductor-0 bash
  6. Check whether the mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/ directory exists by running this command:
    [root@is-en-conductor-0 EngineClients]# ls /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
    If the directory doesn’t exist, create it and navigate to it by running these commands:
    mkdir -p /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
    cd /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
  7. Copy the nz-linuxclient-v7.0.3-P2.tar.gz file to this directory by running this command:
    cp /tmp/nz-linuxclient-v7.0.3-P2.tar.gz /mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/
  8. Create a new directory by running this command:
    mkdir oracle
  9. Extract the nz-linuxclient-v7.0.3-P2.tar.gz file:
    [root@is-en-conductor-0 EngineClients]# tar -xvf nz-linuxclient-v7.0.3-P2.tar.gz
  10. Go to the extracted directory linux64:
    [root@is-en-conductor-0 EngineClients]# cd linux64
  11. Unpack the NPS® Linux® Client:
    [root@is-en-conductor-0 linux64]# unpack
    Unpack the client to [/usr/local/nz] /mnt/IIS_zen/Engine/<project-name>/is-en-conductor-0/EngineClients/nz. If the directory doesn’t exist, specify y to create it.
  12. Go back to the parent directory:
    [root@is-en-conductor-0 linux64]# cd ..
  13. Check the contents of the directory. The output is shown in bold.
    [root@is-en-conductor-0 EngineClients]# ls
    bin64 datadirect.package.tar.z db2_client lib lib64 licenses linux linux64 nz nz-linuxclient-v7.0.3-P2.tar.gz sys webadmin
  14. Navigate to the nz directory and list its contents:
    [root@is-en-conductor-0 EngineClients]# cd nz
    [root@is-en-conductor-0 nz]# ls
    bin64 lib lib64 licenses sys
  15. Edit the odbc.ini file by running these commands:
    #vi $ODBCINI
    Add the following data source information to the odbc.ini file. Replace the values of Servername, Port, Username, and Password with the proper values for your system.
    [NZDSN]
    Driver=/mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients /nz/lib64/libnzodbc.so
    Description=NetezzaSQL ODBC
    Servername=203.0.113.17
    Port=5480
    Database=netezzadb
    Username=user1
    Password=password
    ReadOnly=false
    ShowSystemTables=false
    LegacySQLTables=false
    LoginTimeout=0
    QueryTimeout=0
    DateFormat=1
    NumericAsChar=false
    SQLBitOneZero=false
    StripCRLF=false
    securityLevel=preferredUnSecured
    caCertFile=
  16. Access the dsenv file in the /opt/IBM/InformationServer/Server/DSEngine/ directory and add the following commands to the file:
    export PATH/mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/nz/bin64:$PATH
    export LD_LIBRARY_PATH=/mnt/dedicated_vol/Engine/is-en-conductor-0/EngineClients/nz/lib64:$LD_LIBRARY_PATH
    export NZ_ODBC_INI_PATH=/opt/IBM/InformationServer/DSEngine