Supported data sources for curation and data quality
You can connect to many data sources from which you can import metadata, against which you can run metadata enrichment or data quality rules, which you can use to create dynamic views, and to which you can write the output of data quality rules.
Base Premium Standard Unless otherwise noted, this information applies to all editions of IBM Knowledge Catalog.
- File-storage connectors
- Database connectors
- Connectors and data sources specific to metadata import
- Other data sources
A dash (—) in any of the columns indicates that the data source is not supported for this purpose.
By default, data quality rules and the underlying DataStage flows support standard platform connections. Not all connectors that were supported in traditional DataStage and potentially used in custom DataStage flows are supported in IBM Knowledge Catalog.
Requirements and restrictions
Understand the requirements and restrictions for connections to be used in data curation and data quality assessment.
Required permissions
Users must be authorized to access the connections to the data sources. For metadata import, the user running the import must have the SELECT or a similar permission on the databases in question.
General prerequisites
Connection assets must exist in the project for connections that are used in these cases:
- For running metadata enrichment including advanced analysis (in-depth primary key analysis, in-depth relationship analysis, or advanced data profiling) on assets in a metadata enrichment
- For running data quality rules
- For creating query-based data assets (dynamic views)
- For writing output of data quality checks or frequency distribution tables
Supported source data formats
In general, metadata import, metadata enrichment, and data quality rules support the following data formats:
-
All: Tables from relational and nonrelational data sources
Delta Lake and Iceberg table format for certain file-storage connectors. For analyses to work as expected, import specific files instead of top-level directories:
- For Delta Lake tables, import
_delta_log
files. - For Iceberg tables, import
metadata/version-hint.text
files.
- For Delta Lake tables, import
-
Metadata import: Any format from file-based connections to the data sources and tool-specific formats from connections to external tools. For Microsoft Excel workbooks, each sheet is imported as a separate data asset. The data asset name equals the name of the Excel sheet.
-
Metadata enrichment: Tabular: CSV, TSV, Avro, Parquet, Microsoft Excel (For workbooks uploaded from the local file system, only the first sheet in a workbook is profiled.)
-
Data quality rules: Tabular: Avro, CSV, Parquet, ORC; for data assets uploaded from the local file system, CSV only
Base Premium Data quality features are available only in IBM Knowledge Catalog and IBM Knowledge Catalog Premium.
Lineage import
To import lineage information for your data, you must have one of these services installed:
- IBM MANTA Automated Data Lineage. The column title for lineage imports with this service has the subtitle external lineage.
- IBM Manta Data Lineage. The column title for lineage imports with this service has the subtitle unified lineage.
For more information about differences between these metadata import versions, see Importing metadata.
Database support for analysis output tables
In general, output tables that are generated during analysis can be written to these databases:
If a specific database connector also supports output tables, the Target for output tables column shows a checkmark.
File-storage connectors
Connector | Metadata import (data assets) |
Metadata import (external lineage) |
Metadata import (unified lineage) |
Metadata enrichment | Definition-based rules |
---|---|---|---|---|---|
Amazon S3 Delta Lake tables, Iceberg tables |
✓ | — | — | ✓ | ✓ |
Apache HDFS | ✓ | — | — | ✓ | ✓ |
Box | ✓ | — | — | ✓ 1 | — |
Generic S3 Delta Lake tables, Iceberg tables |
✓ | — | — | ✓ | ✓; |
Google Cloud Storage Delta Lake tables, Iceberg tables |
✓ | — | — | ✓ | ✓ |
IBM Cloud Object Storage | ✓ | — | — | ✓ 1 | — |
IBM Cognos Analytics 2 | ✓ | ✓ 3 | — | ✓ 1 | — |
IBM Match 360 | ✓ | — | — | ✓ | ✓ |
Microsoft Azure Data Lake Storage Delta Lake tables, Iceberg tables |
✓ | — | — | ✓ | ✓ |
Notes:
1 Advanced analysis is not supported for this data source.
2 Cognos Analytics connections that use secrets from a vault as credentials cannot be used for metadata import.
3 Specific JDBC drivers are required. See Uploading JDBC drivers for lineage import in the IBM Software Hub documentation.
Database connectors
Connector | Metadata import (data assets) |
Metadata import (external lineage) |
Metadata import (unified lineage) |
Metadata enrichment | Definition-based rules | SQL-based rules | SQL-based data assets | Target for output tables |
---|---|---|---|---|---|---|---|---|
Amazon RDS for MySQL | ✓ | — | — | ✓ | — | — | — | — |
Amazon RDS for Oracle | ✓ | — | ✓ | — | ✓ | ✓ | — | — |
Amazon RDS for PostgreSQL | ✓ | — | ✓ | ✓ | — | — | — | — |
Amazon Redshift | ✓ | ✓ 9 | — | ✓ 1 | ✓ | ✓ | ✓ | — |
Apache Cassandra | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
Apache Hive | ✓ | ✓ 9 11 | — | ✓ | ✓ | ✓ | ✓ | ✓ 5 |
Apache Impala with Apache Kudu | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
Denodo |
✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
Dremio | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
Google BigQuery | ✓ | ✓ 10 | ✓ |
✓ | ✓ | ✓ | ✓ | ✓ 7 |
Greenplum | ✓ | ✓ | — | ✓ | ✓ | ✓ | ✓ | — |
IBM Cloud Databases for MongoDB | ✓ | — | — | ✓ 1 | — | — | — | — |
IBM Cloud Databases for MySQL | ✓ | — | — | ✓ 1 | — | — | — | — |
IBM Cloud Databases for PostgreSQL | ✓ | ✓ | ✓ | ✓ 1 | — | — | — | — |
Connector | Metadata import (data assets) |
Metadata import (external lineage) |
Metadata import (unified lineage) |
Metadata enrichment | Definition-based rules | SQL-based rules | SQL-based data assets | Target for output tables |
IBM Data Virtualization | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
IBM Data Virtualization Manager for z/OS 2 | ✓ | — | — | ✓ 1 | ✓ | ✓ | — | — |
IBM Db2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
IBM Db2 Big SQL | ✓ | — | — | ✓ 1 | — | — | — | — |
IBM Db2 for z/OS | ✓ | ✓ | ✓ | ✓ 1 | — | — | — | — |
IBM Db2 on Cloud | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ |
IBM Db2 Warehouse | ✓ | — | — | ✓ 1 | ✓ | ✓ | — | — |
IBM Informix | ✓ | — | — | ✓ | — | — | — | — |
IBM Netezza Performance Server | ✓ | ✓ | — | ✓ | ✓ | ✓ | — | — |
IBM watsonx.data | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
MariaDB | ✓ | — | — | ✓ | — | — | — | — |
Microsoft Azure Databricks 8 | ✓ | ✓ | ✓ |
✓ | ✓ | ✓ | ✓ | ✓ |
Microsoft Azure SQL Database | ✓ | — | ✓ | ✓ 12 | ✓ | ✓ | ✓ | ✓ |
Microsoft SQL Server | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Connector | Metadata import (data assets) |
Metadata import (external lineage) |
Metadata import (unified lineage) |
Metadata enrichment | Definition-based rules | SQL-based rules | SQL-based data assets | Target for output tables |
MongoDB | ✓ | — | — | ✓ | ✓ | ✓ | — | — |
MySQL | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | — |
Oracle 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
PostgreSQL | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Presto | ✓ | — | — | ✓ | ✓ | ✓ | — | — |
Salesforce.com | ✓ | — | — | ✓ 4 | — | — | — | — |
SAP ASE | ✓ | — | — | ✓ 1 | ✓ | ✓ | ✓ | — |
SAP HANA | ✓ | — | — | ✓ 1 | ✓ | ✓ | — | — |
SAP IQ | ✓ | — | — | ✓ 1 | — | — | — | — |
SAP OData Authentication method: username and password |
✓ | — | — | ✓ 6 | ✓ | — | — | — |
SingleStoreDB | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | ✓ |
Snowflake | ✓ | ✓ | ✓ | ✓ 12 | ✓ | ✓ | ✓ | — |
Teradata | ✓ | ✓ 9 | — | ✓ | ✓ | ✓ | ✓ | ✓ |
Notes:
1 Advanced analysis is not supported for this data source.
2 With Data Virtualization Manager for z/OS, you add data and COBOL copybooks assets from mainframe systems to catalogs in IBM Cloud Pak for Data. Copybooks are files that describe the data structure of a COBOL program. Data Virtualization Manager for z/OS helps you create virtual tables and views from COBOL copybook maps. You can then use these virtual tables and views to import and catalog mainframe data from mainframes into IBM Cloud Pak for Data in the form of data assets and COBOL copybook assets.
The following types of COBOL copybook maps are not imported: ACI, Catalog, Natural
When the import is finished, you can go to the catalog to review the imported assets, including the COBOL copybook maps, virtual tables, and views. You can use these assets in the same ways as other assets in Cloud Pak for Data.
For more information, see Adding COBOL copybook assets.
3 Table and column descriptions are imported only if the connection is configured with one of the following Metadata discovery options:
- No synonyms
- Remarks and synonyms
4 Some objects in the SFORCE schema are not supported. See Salesforce.com.
5 To create metadata-enrichment output tables in Apache Hive at an earlier version than 3.0.0, you must apply the workaround described in Writing metadata enrichment output to an earlier version of Apache Hive than 3.0.0 in the IBM Software Hub documentation.
6 Information whether the data asset is a table or a view cannot be retrieved and is thus not shown in the enrichment results.
7 Output tables for advanced profiling: If you rerun advanced profiling at too short intervals, results might accumulate because the data might not be updated fast enough in Google BigQuery. wait at least 90 minutes before rerunning advanced profiling with the same output target. For more information, see Stream data availability. Alternatively, you can define a different output table.
8 Hive metastore and Unity catalog
9 Specific JDBC drivers are required. See Uploading JDBC drivers for lineage import in the IBM Software Hub documentation.
10 Connections must be configured with the authentication method Account key (full JSON snippet)
.
11 Hive connections with Kerberos authentication require some prerequisite configurations. See Configuring Hive with Kerberos for lineage imports in the IBM Software Hub documentation.
12 Advanced analysis is supported for this data source starting with IBM Knowledge Catalog 5.1.1.
Connectors and data sources specific to metadata import
If you have one of services installed that provide lineage capabilities, you can import metadata from additional data sources.
IBM MANTA Automated Data Lineage
You can import metadata from data sources where the asset types such as ETL jobs or data models don't allow for enrichment or running data quality rules. A MANTA Automated Data Lineage for IBM Cloud Pak for Data license key is not required for metadata imports of the type Discover. For lineage capture, the knowledge graph feature and use of the FoundationDB graph database must be enabled in IBM Knowledge Catalog, and a MANTA Automated Data Lineage for IBM Cloud Pak for Data license key must be installed.
The following connectors support metadata import of the types Discover and Get lineage:
- Microsoft Power BI Desktop
- Microsoft Power BI (Azure)
- Microsoft SQL Server Integration Services (SSIS)
- Microsoft SQL Server Reporting Services (SSRS)
- MicroStrategy
- Oracle Business Intelligence Enterprise Edition
- Oracle Data Integrator
- Qlik Sense
- SAP BusinessObjects
- Tableau
In addition, you can import these types of data to catalogs:
-
Business intelligence assets from the following tools:
- Microsoft SQL Server Analysis Services (discovery and lineage)
- Statistical Analysis System (SAS) (discovery and lineage)
To import metadata from these sources, you must provide an input file. For more information, see Preparing manual input for importing business intelligence reports.
-
Data integration assets from the following tools:
- DataStage on Cloud Pak for Data (discovery and lineage)
- Informatica PowerCenter (discovery and lineage)
- InfoSphere DataStage (discovery and lineage)
- Talend (discovery and lineage)
To import metadata from these sources, you must provide an input file. For more information, see Preparing ETL job files for metadata import.
-
Data models that were created in the following tools:
- ER/Studio (discovery)
- erwin Data Modeler (discovery)
- SAP PowerDesigner (discovery)
To import metadata from these sources, you must provide an input file. For more information, see Preparing data model files for metadata import.
A MANTA Automated Data Lineage for IBM Cloud Pak for Data license key is required for importing lineage of business intelligence assets, data integration assets, or data models.
IBM Manta Data Lineage
The knowledge graph feature must be enabled in IBM Knowledge Catalog and the default Neo4j graph database must be used.
The following connectors support metadata import with the goal Import lineage metadata. You cannot use them to import asset metadata.
- IBM DataStage for Cloud Pak for Data
- Microsoft Power BI (Azure)
- Microsoft SQL Server Integration Services (SSIS)
- MicroStrategy
- OpenLineage
- Tableau
You can also import lineage metadata from InfoSphere DataStage. You must provide an input file. See Configuring metadata import for data integration assets.
Other data sources
An administrator can upload JDBC drivers to enable connections to more data sources. However, the administrator must test these custom JDBC drivers to ensure they are compatible with the tools that are used to connect to a data source. Metadata import, metadata enrichment, or data quality rules are not guaranteed to work with all JDBC implementations. See Importing JDBC drivers in the IBM Software Hub documentation and Generic JDBC.
Connections that are established by using third-party JDBC drivers were tested for the following data sources:
- Amazon Redshift (Metadata enrichment)
- Apache Cassandra
- Apache Hive
- Apache Kudu (Data quality rules)
- Databricks (Data quality rules)
- Snowflake (Metadata enrichment)
- Teradata (Metadata enrichment)
Learn more
- Importing metadata
- Enriching your data assets
- Creating rules from data quality definitions
- Creating SQL-based rules
Parent topic: Curation