Data Cataloging Harvester CLI
Data Cataloging Harvester is a new capability that is designed to import external data to the Data Cataloging service catalog database. It imports the data even if it is not coming from a Db2 database that uses the same schema.
Before you begin
- Make sure you have the following installations:
- Python +3.11 or later version
curl tar gzip sed pip3
packages- Red Hat® OpenShift® CLI.
With the Data Cataloging Harvester, it is possible to import the external data to the catalog database even if it is not coming from a Db2 table using the exact same schema. IBM Fusion 2.8.0 supports importing data associated to a single data source also known as a connection. After using the CLI against a valid SQLite file, then it should create a connection and import all metadata stored in the import file records or additional tags depending on the command that is executed.
Supported import formats
The Harverster CLI helps to read records in the specified format and sends them to the import service component that runs on the cluster to use by Db2 after it is validated. Not only the input file is in the format that the Harvester supports, but also the schema rules that succeed during analyzing tasks.
- Connections table
- It is used to store information about the data source.
CREATE TABLE CONNECTIONS ( name TEXT, platform TEXT, cluster TEXT, datasource TEXT, site TEXT, host TEXT, mount_point TEXT, application_code TEXT, local_mount TEXT, total_records INTEGER, current_gen TEXT, password TEXT, online TEXT, scan_topic TEXT, le_topic TEXT, le_enabled INTEGER )
- Metaocean table
- It contains all the base metadata present in the data
source.
CREATE TABLE METAOCEAN ( INODE INTEGER, OWNER TEXT, [GROUP] TEXT, PERMISSIONS TEXT, UID INTEGER, GID INTEGER, PATH TEXT, FILENAME TEXT, MTIME TEXT, ATIME TEXT, CTIME TEXT, SIZE INTEGER, TYPE TEXT, STATE TEXT )
The SQLite database might have more optional tables if you want to import tag values for additional metadata. The additional tables cannot be processed unless the tags mode in the Harvester CLI defines tables by their names.
CREATE TABLE META ( inode TEXT, filename TEXT, fkey TEXT, ExampleCustomTag1 TEXT, ExampleCustomTag2 TEXT, ExampleCustomTag3 TEXT, )
Setting up the external host
- Log in to the Red HatOpenShift CLI. For procedure, see Red Hat OpenShift CLI.
- If the Harvester CLI runs on hosts outside of the OpenShift cluster, then it is possible to face
SSL problems caused by self-signed certificates. To resolve SSL problems, you can use the following
commands to download the router certificate in the host that running the Harvester and then use
environment variables to use them.
- Run the following command to download the OpenShift router TLS
certificate.
oc -n openshift-ingress get secret router-certs-default -o json | jq -r '.data."tls.crt"' | base64 -d > $(pwd)/router_tls.crt
- Run the following command to export the environment variables to define the path of the TLS
certificate.
export IMPORT_SERVICE_CERT=$(pwd)/router_tls.crt export DCS_CERT=$(pwd)/router_tls.crt
- Run the following command to download the OpenShift router TLS
certificate.
- Download the Harvester CLI.
export DCS_NAMESPACE=ibm-data-cataloging export DCS_IMPORT_SVC_HOST=$(oc -n $DCS_NAMESPACE get route isd-import-service -o jsonpath="{.spec.host}") export DCS_CONSOLE_HOST=$(oc -n $DCS_NAMESPACE get route console -o jsonpath="{.spec.host}") export DCS_USERNAME=$(oc -n $DCS_NAMESPACE get secret keystone -o jsonpath="{.data.user}" | base64 -d) export DCS_PASSWORD=$(oc -n $DCS_NAMESPACE get secret keystone -o jsonpath="{.data.password}" | base64 -d) curl "https://${DCS_IMPORT_SVC_HOST}/v1/download_harvester" --output harvester.tar.gz --insecure tar xf harvester.tar.gz rm -vf harvester.tar.gz
- Run the following set up script to start by using the Harvester
CLI.
chmod +x ./harvester/setup.sh ./harvester/setup.sh
Running the Harvester CLI
- Make sure that you export the records with all the base
metadata.
SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE
- Use
-p nfs
option if you want to import NFS-based metadata.SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs
- Run the following command to import values after they are created by using either the user
interface or API.
SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester tags $SQLITE_FILE_PATH -c $CONFIG_FILE
- Use
-p nfs
option if you want to import NFS-based metadata.SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs
Configuring the Harvester
config.ini
file inside the harvester directory provides critical support in
the data import behavior. The following are the available variables to configure:- BlockSizeBytes
- It indicates the block size configured in the datasource that is being imported. Normally, the value is set to 4096 bytes, which helps to calculate the real capacity allocated on the filesystem based on the blocks required to store the file.
- AvailableMOVW
- It must be set to 10 as it is the only supported value for optimal data ingestion. It helps the harvester to calculate the pieces of a set of records that can be divided for parallel insertion.
- MaxProcessingWorkers
- It is an ignored variable, and use a one worker process always. If a higher level of parallelism is required, then run another harvester to process an additional SQLite file.
- IngestRequestRetries
- Once the harvester finishes processing the metadata from the SQLite file, it sends it using an
HTTP request to the import service for it to be indexed in the database. But it is possible to fail
to do so for different reasons, such as a temporary network unavailability or a considerable API
overload blocking incoming requests.
By default, the harvester tries to send the metadata 120 times, depending on the highest value and the time limit. The suggested range is 120 to 240.
- StatusRequestRetries
- Once the ingest request is sent, the harvester follows up with additional requests to get the status of the ingest operation to confirm whether it finished successfully. It might take a long time, so retrying is required. The suggested range is 240 to 360.
- TagsWorkdir
- The name of the directory that is presented inside the harvester directory is where temporary
files are stored during tag harvesting. By default,
tmp
is the name of this directory, but it can be updated in case it is necessary to have a more descriptive directory name.
- MOBatchSize and TagsBatchSize
- Number of rows to read from the SQLite and load into memory at a time, a higher value is associated to a higher memory usage. It is not recommended to use a value higher than 1,00,000 because it does not represent a performance benefit. The suggested values are 10000, 50000 and 100000.
- MOBatchesPerSegment and TagsBatchesPerSegment
- Number of batches to process before sending them using an ingestion request, a higher value is associated with bigger files. Using a value higher than 100 does not translate to a direct performance benefit. The suggested values are 10, 50 and 100.