Data Cataloging Harvester CLI
IBM Data Cataloging Harvester is a new capability that is designed to import external data to the IBM Data Cataloging service catalog database. It imports the data even if it is not coming from a Db2 database that uses the same schema.
Before you begin
- Make sure you have the following installations:
- Python +3.11 or later version
curl tar gzip sed pip3
packages- Red Hat® OpenShift® CLI.
With the IBM Data Cataloging Harvester, it is possible to import the external data to the catalog database even if it is not coming from a Db2 table using the exact same schema. IBM Fusion 2.8.0 supports importing data associated to a single data source also known as a connection. After using the CLI against a valid SQLite file, then it should create a connection and import all metadata stored in the import file records or additional tags depending on the command that is executed.
Supported import formats
The Harverster CLI helps to read records in the specified format and sends them to the import service component that runs on the cluster to use by Db2 after it is validated. Not only the input file is in the format that the Harvester supports, but also the schema rules that succeed during analyzing tasks.
- Connections table
- It is used to store information about the data source.
CREATE TABLE CONNECTIONS ( name TEXT, platform TEXT, cluster TEXT, datasource TEXT, site TEXT, host TEXT, mount_point TEXT, application_code TEXT, local_mount TEXT, total_records INTEGER, current_gen TEXT, password TEXT, online TEXT, scan_topic TEXT, le_topic TEXT, le_enabled INTEGER )
- Metaocean table
- It contains all the base metadata present in the data
source.
CREATE TABLE METAOCEAN ( INODE INTEGER, OWNER TEXT, [GROUP] TEXT, PERMISSIONS TEXT, UID INTEGER, GID INTEGER, PATH TEXT, FILENAME TEXT, MTIME TEXT, ATIME TEXT, CTIME TEXT, SIZE INTEGER, TYPE TEXT, STATE TEXT )
The SQLite database might have more optional tables if you want to import tag values for additional metadata. The additional tables cannot be processed unless the tags mode in the Harvester CLI defines tables by their names.
CREATE TABLE META ( inode TEXT, filename TEXT, fkey TEXT, ExampleCustomTag1 TEXT, ExampleCustomTag2 TEXT, ExampleCustomTag3 TEXT, )
Setting up the external host
- Log in to the Red HatOpenShift CLI. For procedure, see Red Hat OpenShift CLI.
- If the Harvester CLI runs on hosts outside of the OpenShift cluster, then it is possible to face
SSL problems caused by self-signed certificates. To resolve SSL problems, you can use the following
commands to download the router certificate in the host that running the Harvester and then use
environment variables to use them.
- Run the following command to download the OpenShift router TLS
certificate.
oc -n openshift-ingress get secret router-certs-default -o json | jq -r '.data."tls.crt"' | base64 -d > $(pwd)/router_tls.crt
- Run the following command to export the environment variables to define the path of the TLS
certificate.
export IMPORT_SERVICE_CERT=$(pwd)/router_tls.crt export DCS_CERT=$(pwd)/router_tls.crt
- Run the following command to download the OpenShift router TLS
certificate.
- Download the Harvester CLI.
export DCS_NAMESPACE=ibm-data-cataloging export DCS_IMPORT_SVC_HOST=$(oc -n $DCS_NAMESPACE get route isd-import-service -o jsonpath="{.spec.host}") export DCS_CONSOLE_HOST=$(oc -n $DCS_NAMESPACE get route console -o jsonpath="{.spec.host}") export DCS_USERNAME=$(oc -n $DCS_NAMESPACE get secret keystone -o jsonpath="{.data.user}" | base64 -d) export DCS_PASSWORD=$(oc -n $DCS_NAMESPACE get secret keystone -o jsonpath="{.data.password}" | base64 -d) curl "https://${DCS_IMPORT_SVC_HOST}/v1/download_harvester" --output harvester.tar.gz --insecure tar xf harvester.tar.gz rm -vf harvester.tar.gz
- Run the following set up script to start by using the Harvester
CLI.
chmod +x ./harvester/setup.sh ./harvester/setup.sh
Running the Harvester CLI
- Make sure that you export the records with all the base
metadata.
SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE
- Use
-p nfs
option if you want to import NFS-based metadata.SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs
- Run the following command to import values after they are created by using either the user
interface or API.
SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester tags $SQLITE_FILE_PATH -c $CONFIG_FILE
- Use
-p nfs
option if you want to import NFS-based metadata.SQLITE_FILE_PATH=/tmp/example.sqlite CONFIG_FILE=$(pwd)/harvester/config.ini python3 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs
Configuring the Harvester
config.ini
file inside the harvester directory provides critical support in
the data import behavior. The following are the available variables to configure:- BlockSizeBytes
- It indicates the block size configured in the datasource that is being imported. Normally, the value is set to 4096 bytes, which helps to calculate the real capacity allocated on the filesystem based on the blocks required to store the file.
- AvailableMOVW
- It must be set to 10 as it is the only supported value for optimal data ingestion. It helps the harvester to calculate the pieces of a set of records that can be divided for parallel insertion.
- MaxProcessingWorkers
- It is an ignored variable, and use a one worker process always. If a higher level of parallelism is required, then run another harvester to process an additional SQLite file.
- IngestRequestRetries
- Once the harvester finishes processing the metadata from the SQLite file, it sends it using an
HTTP request to the import service for it to be indexed in the database. But it is possible to fail
to do so for different reasons, such as a temporary network unavailability or a considerable API
overload blocking incoming requests.
By default, the harvester tries to send the metadata 120 times, depending on the highest value and the time limit. The suggested range is 120 to 240.
- StatusRequestRetries
- Once the ingest request is sent, the harvester follows up with additional requests to get the status of the ingest operation to confirm whether it finished successfully. It might take a long time, so retrying is required. The suggested range is 240 to 360.
- TagsWorkdir
- The name of the directory that is presented inside the harvester directory is where temporary
files are stored during tag harvesting. By default,
tmp
is the name of this directory, but it can be updated in case it is necessary to have a more descriptive directory name.
- MOBatchSize and TagsBatchSize
- Number of rows to read from the SQLite and load into memory at a time, a higher value is associated to a higher memory usage. It is not recommended to use a value higher than 1,00,000 because it does not represent a performance benefit. The suggested values are 10000, 50000 and 100000.
- MOBatchesPerSegment and TagsBatchesPerSegment
- Number of batches to process before sending them using an ingestion request, a higher value is associated with bigger files. Using a value higher than 100 does not translate to a direct performance benefit. The suggested values are 10, 50 and 100.
Data ingestion system requirements
- 4 CPU cores and 16GB of RAM.
- At least 1.5x the size of the SQLite file to import must be available as free space on the filesystem. For example, if the SQLite file is 1GB, then make sure at least 1.5GB of storage is available.
- Python 3.11 or higher.
- Installation
- Run the following commands to download and install IBMFDC.
Optionally, you can set up a virtual environment to benefit from an isolated installation.
Note: Before proceeding, ensure you have an active and authenticated session in the OpenShift console. This is necessary for theoc
command to retrieve route information correctly. For more information, see Red Hat Documentation.DCS_NS="ibm-data-cataloging" DCS_IMPORT_SVC_HOST=$(oc -n ${DCS_NS} get route isd-import-service -o jsonpath='{.spec.host}') curl -O -J "https://${DCS_IMPORT_SVC_HOST}/v1/ibmfdc" --insecure WHEEL_FILE=$(ls ibmfdc-*.whl) pip3 install ${WHEEL_FILE} rm -f ${WHEEL_FILE}
- Preparing the CLI configuration file
- By default, the ibmfdc command looks for a file named config.ini in the current working
directory. If you want to specify a different configuration file, use the
oc
command-line argument.Configuration file format:
For example:ConsoleRouteURL = <DCS_CONSOLE_URL> ImportCSVRouteURL = <DCS_IMPORT_SVC_URL> AdminUsername = <DCS_USERNAME> AdminPassword = <DCS_PASSWORD>
ConsoleRouteURL = https://console-ibm-data-cataloging.apps.<domain> ImportCSVRouteURL = https://isd-import-service-ibm-data-cataloging.apps.<domain> AdminUsername = sdadmin AdminPassword = Passw0rd
- Ingesting remote metadata using the IBMFDC Harvester
- The IBMFDC Harvester supports ingesting both base and additional metadata simultaneously.
However, all necessary tags must be created beforehand, otherwise the meta table in the SQLite file
will be ignored. This is because none of its columns matches any existing tag names.Importing base and additional metadata:
Importing only base metadata (ignores tags):ibmfdc harvester -m meta <path_to_sqlite_file>
ibmfdc harvester <path_to_sqlite_file>