Supported data sources in Data Virtualization

Data Virtualization supports the following relational and nonrelational data sources on IBM Cloud Pak for Data.

See also Supported data sources in Cloud Pak for Data.

Size limits
Data Virtualization supports virtualization of tables with a row size up to 1 MB, and up to 2048 columns in a table. However, the number of columns that Data Virtualization can preview depends on many factors, such as the data types of the columns. Currently, preview is limited to 200 columns.
Comment attributes
When virtual tables are created, Data Virtualization does not include comment attributes that were assigned to data source objects. This limitation applies to all data sources.
Data types
Some data types in your data source might not be supported in Data Virtualization. These limitations are documented in the following tables. Data Virtualization might also map some data types in your data source to alternative data types. These mappings are based on underlying Db2® Big SQL mappings. For more information, see Data types in Db2 Big SQL.

IBM data sources

The following table lists the IBM® data sources that you can connect to from Data Virtualization.

Table 1. Supported IBM data sources
Connector Limitations More information
IBM Cloud Databases for MongoDB IBM Cloud Databases for MongoDB is available as beta.

The following MongoDB data types are supported in Data Virtualization: INT32, INT64, DOUBLE, STRING, BOOLEAN, DATE, and BINARY.

 
IBM Cloud Databases for MySQL    
IBM Cloud Databases for PostgreSQL  

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Cloud Object Storage

This connection requires special consideration in Data Virtualization. See Connecting to IBM Cloud Object Storage in Data Virtualization.

For limitations, see Data sources in object storage in Data Virtualization.

 
IBM Data Virtualization Manager for z/OS  

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Db2
  • The NCHAR and NVARCHAR types are not supported in Data Virtualization.
  • You can edit a Db2 connection on the Data Sources page in Data Virtualization if the connection was created on that page. You cannot edit a Db2 connection on the Connectivity page if you created that Db2 connection on the Data sources page in Data Virtualization.

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Db2 Big SQL  

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Db2 for i  

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Db2 for z/OS  
IBM Db2 on Cloud The NCHAR and NVARCHAR types are not supported in Data Virtualization.  
IBM Db2 Warehouse    
IBM Informix® INTERVAL, BIGINT, and BIGSERIAL data types are not supported in Data Virtualization. For more information, see Exceptions occur when using the Connect for JDBC Informix driver.
IBM Netezza Performance Server
  • BLOB, XML, and CLOB Netezza data types are not supported in Data Virtualization.
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Planning Analytics
  • Data Virtualization can virtualize only TM1® views with one column dimension.
  • Aggregation cannot be pushed down to TM1. If a query against TM1 contains an aggregation operation such as COUNT, SUM, AVG, MAX, MIN, or GROUP BY, you can add the following server option.
    alter server qplex options (add DB2_GROUP_BY 'N') 
  • Data Virtualization supports only VARCHAR and DOUBLE data types in TM1.
  • TM1 doesn't recognize SQL. The native query language of TM1 is MDX, which doesn't have the concept of pushdown. Therefore, this data source is not configured to support optimal predicate pushdown.
 
IBM Data Virtualization
Important: Do not create a connection to your Data Virtualization instance.

See also Manually creating a Data Virtualization connection in a catalog or project.

 

Third-party data sources

The following table lists the third-party data sources that you can connect to from Data Virtualization.

Table 2. Supported third-party data sources
Connector Limitations More information
Amazon RDS for MySQL
  • You must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page. MySQL does not support database isolation.
  • The BIT types are not supported in Data Virtualization.

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon RDS for Oracle
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.
  • You create a join view of two tables from the same data source that have two columns with the VARCHAR data type. However, the join view is not pushed down on the remote data source. To solve this issue, see Cannot push down joined views.
  • The following data types are converted in Data Virtualization:
    • INTERVAL and JSON convert to VARCHAR.
    • National character types convert to CHAR, VARCHAR, or CLOB.
    • TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE convert to TIMESTAMP.
    • XML convert to CLOB.

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon RDS for PostgreSQL  

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon Redshift SPATIAL, SKETCH, and SUPER data types are converted to CLOB in Data Virtualization.

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon S3

This connection requires special consideration in Data Virtualization. See Connecting to Amazon S3 in Data Virtualization.

  • Specific rules apply for naming buckets in Amazon S3. For more information, see Bucket naming rules in the Amazon S3 documentation.
  • If you include dots in a bucket's name, you can't use virtual-host-style addressing over HTTPS, unless you perform your own certificate validation. The security certificates that are used for virtual hosting of buckets don't work for buckets with dots in their names.
  • With AWS PrivateLink for Amazon S3, you can provision interface VPC endpoints (interface endpoints) in your virtual private cloud (VPC). You cannot use these endpoints in the Endpoint URL when you create a connection to an Amazon S3 data source. This type of endpoint is not supported. You must use the standard endpoint for Amazon S3, for example, http://s3.eu-central-1.amazonaws.com/.
  • For more limitations, see Data sources in object storage in Data Virtualization.
 
Apache Derby  

This connection is optimized to take advantage of the query capabilities in this data source.

Apache Hive
  • In virtualized tables, you can list only the value of complex data types, such as array, map, struct, and union type. Any other operation on the value of these complex data types is not supported.
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.
  • String and Binary types in Hive data sources now resolve to VARCHAR(32672) and VARBINARY(32672) by default, instead of CLOB and BLOB. A Data Virtualization Admin can configure HIVE_MAX_STRING_SIZE and HIVE_MAX_BINARY_SIZE. This update optimizes SQL performance for Hive data sources. A Data Virtualization Admin can configure HIVE_MAX_STRING_SIZE and HIVE_MAX_BINARY_SIZE by calling the DVSYS.SETCONFIGPROPERTY procedure.

    For example, to change the default maximum size for the Hive BINARY data type to 2000, which resolve to VARBINARY(2000), run the following command.

    db2 "call dvsys.setconfigproperty ('HIVE_MAX_BINARY_SIZE', '2000', '', ?, ?)"

    The minimum size is 1 and the maximum recommended size is 32672. You can exceed 32672; however, STRING and BINARY types will resolve to CLOB and BLOB types, which can cause certain queries to fail.

This connection is optimized to take advantage of the query capabilities in this data source.

Ceph

This connection requires special consideration in Data Virtualization. See Connecting to Ceph in Data Virtualization.

For limitations, see Data sources in object storage in Data Virtualization.

 
Apache Impala  

This connection is optimized to take advantage of the query capabilities in this data source.

Databricks

This connection requires special consideration in Data Virtualization. See Connecting to data sources with a generic JDBC driver in.

Before you upgrade from a previous version of Data Virtualization on Cloud Pak for Data to 5.0, any previous Databricks connections with the connection ID format OTHER1000 need to be replaced with the new connection ID format, DATAB1000.

If you have any Databricks connections established through generic connection in previous Data Virtualization versions, prior to upgrading to Cloud Pak for Data 5, you must run the following queries to modify the virtualized table list:
  1. In the Data virtualization > Run SQL editor, run the DVSYS.listrdbc query to identify the internal Connection ID (CID) and NODE_NAME for your Databricks data source:
    select * DVSYS.listrdbc;
  2. Run the following queries for all your Databricks connections, replacing <(CID)>, <catalog_name>, and <NODE_NAME> with your own values:
    call dvsys.SETCONFIGPROPERTY('RELOADTABLES_ALLTABS_<(CID)>', '<catalog_name>, null, null, null','<NODE_NAME>', ?, ?)
    call dvsys.SETCONFIGPROPERTY('RELOADTABLES_ALLCOLS_<(CID)>','<catalog_name>, null, null, nulll','<NODE_NAME>', ?, ?)
    

    The <catalog_name> is the corresponding catalog where the connection is established.

    See the following statements as example queries, where:
    • <(CID)> has the value OTHER10000,
    • <catalog_name> has the value hive_metastore,
    • and <NODE_NAME> has the value qpendpoint_1:6415.
    call dvsys.SETCONFIGPROPERTY('RELOADTABLES_ALLTABS_OTHER10000', 'hive_metastore,null,null,null','qpendpoint_1:6415', ?, ?)
    call dvsys.SETCONFIGPROPERTY('RELOADTABLES_ALLCOLS_OTHER10000', 'hive_metastore,null,null,null','qpendpoint_1:6415', ?, ?)
    Note: If the per-upgrade steps to modify the virtualized table list do not work for you, or you did not perform the pre-upgrade steps, follow the steps in this troubleshooting topic to resolve the issue: Virtualized table list for Databricks needs manual updates after upgrade to 5.0.
A default catalog is selected for you when you create connections to Databricks. Also, you can access objects that you virtualized in a different Databricks catalog without switching back to that catalog.
Denodo
  • You can edit a Denodo connection on the Data Sources page in Data Virtualization if the connection was created on that page. You cannot edit a Denodo connection on the Connectivity page if you created that Denodo connection on the Data sources page in Data Virtualization.
 
File system
Important: This type of connection is not recommended for use in IBM Cloud Pak for Data. Connect to files in your file system by using a remote connector. For more information, see Accessing data sources by using remote connectors in Data Virtualization.
 
Generic S3

For limitations, see Data sources in object storage in Data Virtualization.

 
Google BigQuery

This connection requires special consideration in Data Virtualization. See Connecting to Google BigQuery in Data Virtualization.

  • In the Google BigQuery data source, Data Virtualization does not support the use of the SELECT statement for columns with data type of RECORD.

This connection is optimized to take advantage of the query capabilities in this data source.

Greenplum  

This connection is optimized to take advantage of the query capabilities in this data source.

MariaDB
  • The LONGTEXT type is converted to CLOB(65535) and the LONGBLOB type is converted to BLOB(65535). VARCHAR with size that exceeds 32672 is also converted to CLOB(65535). The BIT type is not supported in Data Virtualization.
  • You must create only one connection to a database on a MariaDB data source to avoid duplicate rows on the Virtualize page. MariaDB does not support database isolation.
  • You must have admin permission for the MariaDB data source for advanced operations such as collecting statistics.
  • For MariaDB versions earlier than version 10, collect catalog statistic is not supported.

This connection is optimized to take advantage of the query capabilities in this data source.

Microsoft Azure Data Lake Storage Gen2    
Microsoft Azure SQL Database    
Microsoft SQL Server
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.

This connection is optimized to take advantage of the query capabilities in this data source.

MinIO

This connection requires special consideration in Data Virtualization. See Connecting to MinIO in Data Virtualization.

For limitations, see Data sources in object storage in Data Virtualization.

 
MongoDB
  • The BIT, LONGTEXT, and LONGBLOB types are not supported in Data Virtualization.
  • You must have admin permission for the MongoDB data source for advanced operations such as collecting statistics.
 
MySQL

(My SQL Community Edition)

(My SQL Enterprise Edition)

  • You must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page. MySQL does not support database isolation.
  • The BIT types are not supported in Data Virtualization.
  • The BIGINT function is not supported with set operations such as INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL in Data Virtualization.

This connection is optimized to take advantage of the query capabilities in this data source.

Oracle
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.
  • You create a join view of two tables from the same data source that have two columns with the VARCHAR data type. However, the join view is not pushed down on the remote data source. To solve this issue, see Cannot push down joined views.
  • The following data types are converted in Data Virtualization:
    • INTERVAL and JSON convert to VARCHAR.
    • National character types convert to CHAR, VARCHAR, or CLOB.
    • TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE convert to TIMESTAMP.
    • XML convert to CLOB.
    • SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0.

This connection is optimized to take advantage of the query capabilities in this data source.

PostgreSQL  

This connection is optimized to take advantage of the query capabilities in this data source.

Presto

If you require duplicate connections, the best practice is to create them in Platform Connections first and then import them into Data Virtualization. If you follow this method, you can restore any duplicate connections that might be removed from Data Virtualization in the future. Old virtual tables that are linked to those connections are also restored. However, consider limiting the use of duplicate connections because the duplicate connections require more management and can impact query performance.

If you create a duplicate connection directly in the Data Virtualization web client or by using the SETRDBCX API, you cannot restore the connection or the linked virtual tables if they are removed.

A default catalog is selected for you when you create connections to Presto. Also, you can access objects that you virtualized in a different Presto catalog without switching back to that catalog.

REST API This connection requires special consideration in Data Virtualization. See Connecting to REST API in Data Virtualization.

This connection is optimized to take advantage of the query capabilities in this data source.

Salesforce.com  

This connection is optimized to take advantage of the query capabilities in this data source.

SAP ASE    
SAP Business Warehouse (BW)

(Using the SAP OData connector)

If your SAP Business Warehouse (BW) data source is not configured to expose its CatalogService, you must specify an explicit service (or a list of services) by using the DataDirect ServiceList connection property. Include this ServiceList connection property in the SAP Gateway URL connection property.

You can check the CatalogService by using an HTTP request in the following format:

http://<server><:port>/sap/opu/odata/iwfnd/CATALOGSERVICE/
For example:
http://bw4hana.ibm.com:8000/sap/opu/odata/iwfnd/CATALOGSERVICE;v=0002/ServiceCollection
You cannot preview or query nonreadable tables due to the following reasons:
  • The SAP OData data source might have write-only access. In such cases, changing the user permissions does not avoid this issue.
  • The SAP OData data source has read access, but requires filters. This limitation means that you cannot preview data, but you can read it if you specify filters.
 
SAP HANA and SAP HANA on SAP Cloud Platform

This connection requires special consideration in Data Virtualization. See Connecting to SAP HANA in Data Virtualization and Connecting to SAP HANA on SAP Cloud Platform in Data Virtualization.

  • You can edit a SAP HANA connection on the Data Sources page in Data Virtualization if the connection was created on that page. You cannot edit a SAP HANA connection on the Connectivity page if you created that SAP HANA connection on the Data sources page in Data Virtualization.
To configure pushdown of VARCHAR predicates to SAP HANA, apply the following steps.
  1. In the Data virtualization > Run SQL editor, run the following statement to identify the internal Connection ID (CID) for your SAP HANA data source:

    select * from DVSYS.listrdbc;

  2. Now, run the following two commands to enable pushdown for the CID. In the following example, the CID is SAPHA10005:

    alter server qplex options(add SAPHA10005@collating_sequence 'Y'); alter server qplex options(add SAPHA10005@varchar_no_trailing_blanks 'Y');

Note: You must ensure that you have no trailing blanks in the varchar data and that the collating sequence in Data Virtualization is the same as on the remote SAP HANA data source.
Note:

If you preview SAP HANA data sources that contain data with a data type of TINYINT, you see inaccurate data for some rows of type TINYINT. However, you can virtualize the data source. Only preview is affected.

This connection is optimized to take advantage of the query capabilities in this data source.

SAP OData
You cannot preview or query nonreadable tables due to the following reasons:
  • The SAP OData data source might have write-only access. In such cases, changing the user permissions does not avoid this issue.
  • The SAP OData data source has read access, but requires filters. This limitation means that you cannot preview data, but you can read it if you specify filters.
 
SAP S/4HANA

(Using the SAP OData connector)

If your SAP S/4HANA data source is not configured to expose its CatalogService, you must specify an explicit service (or a list of services) by using the DataDirect ServiceList connection property. Include this ServiceList connection property in the SAP Gateway URL connection property.

You can check the CatalogService by using an HTTP request in the following format http://<server><:port>/sap/opu/odata/iwfnd/CATALOGSERVICE/.

For more information, see the SAP Help documentation.

You cannot preview or query nonreadable tables due to the following reasons:
  • The SAP OData data source might have write-only access. In such cases, changing the user permissions does not avoid this issue.
  • The SAP OData data source has read access, but requires filters. This limitation means that you cannot preview data, but you can read it if you specify filters.
  • You can edit a SAP S/4HANA connection on the Data Sources page in Data Virtualization if the connection was created on that page. You cannot edit a SAP S/4HANA connection on the Connectivity page if you created that SAP S/4HANA connection on the Data sources page in Data Virtualization.
 
Snowflake

This connection requires special consideration in Data Virtualization. See Connecting to Snowflake in Data Virtualization.

  • ARRAY, GEOGRAPHY, and VARIANT data types are converted to VARCHAR.

This connection is optimized to take advantage of the query capabilities in this data source.

Spark SQL This connection requires special consideration in Data Virtualization. See Connecting to Spark SQL.
  • You can edit a Spark SQL connection on the Data Sources page in Data Virtualization if the connection was created on that page. You cannot edit a Spark SQL connection on the Connectivity page if you created that Spark SQL connection on the Data sources page in Data Virtualization.

This connection is optimized to take advantage of the query capabilities in this data source.

Teradata

Teradata JDBC Driver 17.00 Copyright (C) 2024 by Teradata. All rights reserved. IBM provides embedded usage of the Teradata JDBC Driver under license from Teradata solely for use as part of the IBM Watson® service offering.

  • The XML Teradata data type is not supported in Data Virtualization. The XML data type is converted to CLOB.
  • DECIMAL data is supported to 34 digits with DECFLOAT. Data columns that exceed this limit must be edited to DOUBLE during virtualization.
  • The following data types are converted in Data Virtualization:
    • Queries that include a string comparison operation such as a GROUP BY or WHERE predicate against CHAR or VARCHAR data to handle case sensitivity.
    • VARBYTE is converted to VARCHAR for bit data.
    • BYTE is converted to BINARY.
    • BYTEINT is converted to SMALLINT.
    • TIME ZONE BEARING is converted to TIME and TIMESTAMP.
    • PERIOD, DISTINCT, DATASET, and ARRAY are converted to BLOB.
    • NUMBER is converted to DOUBLE.
    • YEAR, YEAR TO MONTH, DAY, DAY TO MINUTE, HOUR TO MINUTE, MONTH, HOUR, and MINUTE is converted to INTEGER.
    • DAY TO SECOND, HOUR TO SECOND, MINUTE TO SECOND, SECOND is converted to DECIMAL.
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.

This connection is optimized to take advantage of the query capabilities in this data source.

User-defined data sources

The following table lists the user-defined data sources that you can connect to from Data Virtualization.

Table 3. Supported user-defined data sources
Connector Limitations More information
Generic JDBC

This connection requires special consideration in Data Virtualization. See Connecting to data sources with a generic JDBC driver in Data Virtualization.

Note: Use this connector only when your data source is not listed as a supported data source.
 

Data sources in object storage in Data Virtualization

You can use data that is stored as files on IBM Cloud Object Storage, Amazon S3, Ceph, or MinIO data sources to create virtual tables. To access data that is stored in cloud object storage, you must create a connection to the data source where the files are located.

You can segment or combine data from one or more files to create a virtual table. Accessing files in object storage in Data Virtualization is built on Db2 Big SQL capabilities that use Hadoop external table support. For more information, see CREATE TABLE (HADOOP) statement.

Terminology

Buckets and partitions play important roles in data organization. The following components are key elements of object storage.
  • A bucket is a logical abstraction that is used to provide a container for data. There is no folder concept in object storage; only buckets and keys. Buckets can be created only in the interface of the object storage data source. They cannot be created in Data Virtualization. Bucket names must be unique and adhere to the rules of the object storage provider. These rules often include restricting the name to 3 - 63 characters with lowercase letters, numbers, and dashes only. Bucket names must begin and end with a lowercase letter or number. When Data Virtualization accesses data in object storage, the bucket name must be unique across all object storage connections.
  • A file path is the complete path to the file where you want to store data. The S3 file system implementation allows zero-length files to be treated like directories, and file names that contain a forward slash (/) are treated like nested directories. The file path includes the bucket name, an optional file path, and a file name. In object storage, the file path is used when a table is created. All files in the same path contribute to the table data. You can add more data by adding another file to the file path.
  • A partition is data that is grouped by a common attribute in the schema. Partitioning divides the data into multiple file paths, which are treated like directories. Data Virtualization can discover and use partitions to reduce the amount of data that queries must process, which can improve performance of queries that use predicates on the partitioning columns.

Best practices

File formats
Data Virtualization supports PARQUET (or PARQUETFILE), ORC (optimized row columnar), CSV (comma-separated values), TSV (tab-separated values), and JSON file formats. No other file formats are supported.
Note: Data Virtualization virtualize data files based on their folder or hierarchy structure. You must ensure that all your data files within the folder or hierarchy are of the same datatype, and follow the same format.
  • For PARQUET (or PARQUETFILE), file extensions are not required. Metadata is extracted from the data file.
  • For ORC, file extensions are not required. Metadata is extracted from the data file.
  • For CSV and TSV files:
    • The appropriate .csv or .tsv file extension is required, as follows:
      • CSV: The .csv file extension is required and the contents of the file must follow the specifications for comma-separated values.
      • TSV: The .tsv file extension is required and the contents of the file must follow the specifications for tab-separated values.
    • An optional parameter can be used to specify a string delimiter character (quoteChar) that surrounds field values in CSV and TSV files.
      • The performance of querying the data might be negatively impacted if quoteChar is specified.
      • The default value is no delimiter (not specified).
      • The value for quoteChar must be a single character that cannot be a space (blank), backslash, tab, comma, or newline character (\n).
      • If the string value contains the string delimiter (quoteChar), the backslash (\) can be used to escape the string delimiter.
  • For JSON files, the .json file extension is required. JSON files must be coded so that each line is a valid JSON object. Lines must be separated by a newline character (\n). The JSON Lines text format, also called newline-delimited JSON, is the only supported JSON format. This format stores structured data that can be processed one record at a time.
Note: All other file formats return an error. For more information, see Error message when you try to use an unsupported file format in Cloud Object Storage.
Organizing data
  • Avoid using alphanumeric characters in column names as it may interfere with Hive compatibility. Any character that is not an alphanumeric character or the underscore character is encoded as _xNNNN, where _xNNNN is the hex value of the character. If you want to properly view column names, enable the allownonalphanumeric option by completing these steps:
    1. Access the head pod in the Data Virtualization instance (c-db2u-dv-db2u-0).
    2. Run the following command to edit the configuration to include the allownonalphanumeric option:
      db2uctl adm bigsql config --key bigsql.catalog.identifier.mappingrules --value allownonalphanumeric,allowleadingdigits,allowreservedwords
    3. Run the following command to restart Big SQL:
      su - db2inst1 
      bigsql stop ; 
      bigsql start 
      
  • If your object storage data is accessed through a virtualized table, the files that you want to virtualize must be within a single file path and within a single bucket, and the bucket must include at least one file that you add to the cart. All the files in this file path are part of the virtualized table. When more data is added to the table (new files are created in the file path), the data is visible when you access the virtualized table. All files in the file path must use the same file format so that they are virtualized as one table.
  • If you want to virtualize the files in multiple file paths as one table, you can virtualize the bucket that contains all of the files. For example, if you have file paths A/B/C/T1a, A/B/C/T1b, A/B/D/T1c, and A/B/D/T1d, you can virtualize the file path A/B/. All the files in that path and nested paths will be part of the accessible object.
  • Do not create two objects (tables, schemas, or columns) with the same name, even if you use delimited identifiers and mixed case. For example, you cannot have a table t1 and another table that is named T1. These names are considered to be duplicate names in object storage (Hive). For more information, see Identifiers.
  • Db2 supports a wider range of valid delimited identifiers than Hive supports. Some identifier names that are specified when you create virtualized tables over object store might be adjusted before they can be accepted into the Hive catalog. The mapping is done automatically. For more information, see Identifiers.
  • When new data is added to the file path for a virtualized table, consider running the following command to ensure that the metadata cache is updated to see the new data.
    CALL SYSHADOOP.HCAT_CACHE_SYNC(<schema>, <object>)

    For more information, see the HCAT_CACHE_SYNC stored procedure.

  • When new partitions are added to the file path for the virtualized table, click Refresh partitions in the overflow menu on the Virtualized data page to identify new partitions.

    You can also run the following command in the SQL interface to identify the new partitions that were added.

    MSCK REPAIR TABLE <table-name> 

    For more information, see MSCK REPAIR TABLE.

Optimizing query performance
  • Use a compact file format such as ORC or Parquet to minimize network traffic, which improves query performance.
  • Don't use the STRING or TEXT data type. Use the VARCHAR(n) data type, with n set to a value that is appropriate for the column data. Use the following command to alter the table to define an appropriate length for the column.
     ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
  • Partition your data by using Hive style partitioning. Partitioned data is grouped by a common attribute. Data Virtualization can use partitions to reduce the amount of data that queries must process. Querying the entire data set might not be possible or even necessary. You can use predicates in your queries that include the partitioning columns to improve performance.

    For example, a school_records table that is partitioned on a year column, segregates values by year into separate file paths. A WHERE condition such as YEAR=1993, YEAR IN (1996,1995), or YEAR BETWEEN 1992 AND 1996 scans only the data in the appropriate file path to resolve the query.

    Example of partitions in cloud object storage.
  • Define partitioned column types accurately. By default, partitioned character columns are assumed to be STRING type, which is not recommended. Redefine the partitioned columns to an appropriate data type.
  • Collect statistics on the data that is being queried. Data Virtualization uses the ANALYZE command to collect statistics on virtualized tables over object storage. You can collect statistics in the web client or by using SQL. For more information, see Collecting statistics in Data Virtualization.

Limitations

  • Only UTF-8 character encoding is supported in Data Virtualization for text files in CSV, TSV, or JSON format. Cloud Object Storage binary formats such as ORC or PARQUET are unaffected because they transparently encode character types.
  • Data Virtualization does not support the TIME data type in a virtualized table over object storage.
  • Preview of assets in cloud object storage shows only the first 200 columns of the table.
  • Before you remove a connection to Cloud Object Storage, you must remove all virtualized tables in the object storage connection. If a connection is removed and you try to remove a table in that connection, you see an error. See Credential error message when you remove a virtualized table in object storage.
  • If the bucket is not specified in the connection properties, the connection is global. In this case, include the bucket name in the file path. You can specify up to one global connection in a Data Virtualization instance.

See also Restrictions in CREATE TABLE (HADOOP) statement.