Data Cataloging Harvester CLI

Data Cataloging Harvester is a new capability that is designed to import external data to the Data Cataloging service catalog database. It imports the data even if it is not coming from a Db2 database that uses the same schema.

Before you begin

  • Make sure you have the following installations:
    • Python +3.11 or later version
    • curl tar gzip sed pip3 packages
    • Red Hat® OpenShift® CLI.

With the Data Cataloging Harvester, it is possible to import the external data to the catalog database even if it is not coming from a Db2 table using the exact same schema. IBM Fusion 2.8.0 supports importing data associated to a single data source also known as a connection. After using the CLI against a valid SQLite file, then it should create a connection and import all metadata stored in the import file records or additional tags depending on the command that is executed.

Supported import formats

The Harverster CLI helps to read records in the specified format and sends them to the import service component that runs on the cluster to use by Db2 after it is validated. Not only the input file is in the format that the Harvester supports, but also the schema rules that succeed during analyzing tasks.

The Data Cataloging harverster needs the following schemas. The Data Cataloging harvester requires at least two tables to work with:
Connections table
It is used to store information about the data source.
CREATE TABLE CONNECTIONS (
    name TEXT,
    platform TEXT,
    cluster TEXT,
    datasource TEXT,
    site TEXT,
    host TEXT,
    mount_point TEXT,
    application_code TEXT,
    local_mount TEXT,
    total_records INTEGER,
    current_gen TEXT,
    password TEXT,
    online TEXT,
    scan_topic TEXT,
    le_topic TEXT,
    le_enabled INTEGER
)
Metaocean table
It contains all the base metadata present in the data source.
CREATE TABLE METAOCEAN (
    INODE INTEGER,
    OWNER TEXT,
    [GROUP] TEXT,
    PERMISSIONS TEXT,
    UID INTEGER,
    GID INTEGER,
    PATH TEXT,
    FILENAME TEXT,
    MTIME TEXT,
    ATIME TEXT,
    CTIME TEXT,
    SIZE INTEGER,
    TYPE TEXT,
    STATE TEXT
)

The SQLite database might have more optional tables if you want to import tag values for additional metadata. The additional tables cannot be processed unless the tags mode in the Harvester CLI defines tables by their names.

Example of additional tags table:
Note: Inode and filename are only the required columns, and rest are the names of already created tags on Data Cataloging.
CREATE TABLE META (
    inode TEXT,
    filename TEXT,
    fkey TEXT,
    ExampleCustomTag1 TEXT,
    ExampleCustomTag2 TEXT,
    ExampleCustomTag3 TEXT,
)

Setting up the external host

  1. Log in to the Red HatOpenShift CLI. For procedure, see Red Hat OpenShift CLI.
  2. If the Harvester CLI runs on hosts outside of the OpenShift cluster, then it is possible to face SSL problems caused by self-signed certificates. To resolve SSL problems, you can use the following commands to download the router certificate in the host that running the Harvester and then use environment variables to use them.
    1. Run the following command to download the OpenShift router TLS certificate.
      oc -n openshift-ingress get secret router-certs-default -o json | jq -r '.data."tls.crt"' | base64 -d > $(pwd)/router_tls.crt
    2. Run the following command to export the environment variables to define the path of the TLS certificate.
      export IMPORT_SERVICE_CERT=$(pwd)/router_tls.crt
      export DCS_CERT=$(pwd)/router_tls.crt
  3. Download the Harvester CLI.
    export DCS_NAMESPACE=ibm-data-cataloging
    export DCS_IMPORT_SVC_HOST=$(oc -n $DCS_NAMESPACE get route isd-import-service -o jsonpath="{.spec.host}")
    export DCS_CONSOLE_HOST=$(oc -n $DCS_NAMESPACE get route console -o jsonpath="{.spec.host}")
    export DCS_USERNAME=$(oc -n $DCS_NAMESPACE get secret keystone -o jsonpath="{.data.user}" | base64 -d)
    export DCS_PASSWORD=$(oc -n $DCS_NAMESPACE get secret keystone -o jsonpath="{.data.password}" | base64 -d)
    
    curl "https://${DCS_IMPORT_SVC_HOST}/v1/download_harvester" --output harvester.tar.gz --insecure
    tar xf harvester.tar.gz
    rm -vf harvester.tar.gz
  4. Run the following set up script to start by using the Harvester CLI.
    chmod +x ./harvester/setup.sh
    ./harvester/setup.sh

Running the Harvester CLI

  1. Make sure that you export the records with all the base metadata.
    SQLITE_FILE_PATH=/tmp/example.sqlite
    CONFIG_FILE=$(pwd)/harvester/config.ini
    python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE
  2. Use -p nfs option if you want to import NFS-based metadata.
    SQLITE_FILE_PATH=/tmp/example.sqlite
    CONFIG_FILE=$(pwd)/harvester/config.ini
    python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs
  3. Run the following command to import values after they are created by using either the user interface or API.
    SQLITE_FILE_PATH=/tmp/example.sqlite
    CONFIG_FILE=$(pwd)/harvester/config.ini
    python3 harvester tags $SQLITE_FILE_PATH -c $CONFIG_FILE
  4. Use -p nfs option if you want to import NFS-based metadata.
    SQLITE_FILE_PATH=/tmp/example.sqlite
    CONFIG_FILE=$(pwd)/harvester/config.ini
    python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs

Configuring the Harvester

The config.ini file inside the harvester directory provides critical support in the data import behavior. The following are the available variables to configure:
BlockSizeBytes
It indicates the block size configured in the datasource that is being imported. Normally, the value is set to 4096 bytes, which helps to calculate the real capacity allocated on the filesystem based on the blocks required to store the file.
AvailableMOVW
It must be set to 10 as it is the only supported value for optimal data ingestion. It helps the harvester to calculate the pieces of a set of records that can be divided for parallel insertion.
MaxProcessingWorkers
It is an ignored variable, and use a one worker process always. If a higher level of parallelism is required, then run another harvester to process an additional SQLite file.
IngestRequestRetries
Once the harvester finishes processing the metadata from the SQLite file, it sends it using an HTTP request to the import service for it to be indexed in the database. But it is possible to fail to do so for different reasons, such as a temporary network unavailability or a considerable API overload blocking incoming requests.

By default, the harvester tries to send the metadata 120 times, depending on the highest value and the time limit. The suggested range is 120 to 240.

StatusRequestRetries
Once the ingest request is sent, the harvester follows up with additional requests to get the status of the ingest operation to confirm whether it finished successfully. It might take a long time, so retrying is required. The suggested range is 240 to 360.
TagsWorkdir
The name of the directory that is presented inside the harvester directory is where temporary files are stored during tag harvesting. By default, tmp is the name of this directory, but it can be updated in case it is necessary to have a more descriptive directory name.
MOBatchSize and TagsBatchSize
Number of rows to read from the SQLite and load into memory at a time, a higher value is associated to a higher memory usage. It is not recommended to use a value higher than 1,00,000 because it does not represent a performance benefit. The suggested values are 10000, 50000 and 100000.
MOBatchesPerSegment and TagsBatchesPerSegment
Number of batches to process before sending them using an ingestion request, a higher value is associated with bigger files. Using a value higher than 100 does not translate to a direct performance benefit. The suggested values are 10, 50 and 100.