Creating a virtualized table from files by using a remote connector in Watson Query

Important: IBM Cloud Pak® for Data Version 4.7 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

You can use data that is stored as CSV, TSV, and Excel files on remote data sources to create virtual tables. Segment or combine data from one or more files to create a virtual table.

Remember:

The data requests (Data > Data requests) feature was removed in Cloud Pak for Data Version 4.7.0. Consider workflows instead.

Before you begin

To access data that is stored in CSV, TSV, and Excel files, you must install a remote connector on the data source where the files are located. A remote connector enables Watson Query to automatically access files located in a remote data source. For more information, see Installing connectors on remote data sources.

Watson Query auto-detects a table in an Excel sheet by scanning each row until it reaches a non-empty cell. The final column is determined from the first non-empty cell in the first row. The final row is the first one having empty cells in all columns.

The Excel source wrapper in Watson Query was upgraded in Cloud Pak for Data 4.0.2 to allow access to spreadsheets of unlimited size. In previous releases, Watson Query might fail to access spreadsheets that contain numerous data, for example greater than 250,000 cells, or have a large file size, for example greater than 3 MB, which easily happens when a workbook contains macros. If you encounter any issues (for example, unexpected types or values), you can toggle between new and old source wrappers with the following SQL API calls.

To switch to the previous Excel source wrapper named GExcel, use the following command:

CALL DVSYS.SETCONFIGPROPERTY('EXCEL_SOURCE_WRAPPER_CLASS', 'com.ibm.db2j.GExcel', '', ?, ?)

To switch to the current Excel source wrapper named QExcel, use the following command:

CALL DVSYS.SETCONFIGPROPERTY('EXCEL_SOURCE_WRAPPER_CLASS', 'com.ibm.db2j.QExcel', '', ?, ?)

For more information, see Queries on virtualized flat files fail with incorrect results in Watson Query.

Procedure

On the navigation menu, click Data > Data virtualization.
The service menu opens to the Data sources page by default.
On the service menu, click Virtualization > Virtualize and then click the Files tab.
The list of available data sources appears. You can narrow down the displayed assets by using the available filters.
Select the remote data source where you want to browse for files.
The list of files and directories on your data source appears. You can enter a particular path on the host system in the File path field or click the files and directories on the list to go through the host system.
To edit the remote data source, click Edit connection in the action menu of the remote data source.
Select the file that you want to virtualize and click Add to cart.
- To use the first row of your file as column headings in your virtual table, enable the Use first row as column heading option. If this option is not enabled, you are prompted to specify column names for your virtual file before you add it to your cart.
- If your file has multiple worksheets, you can select a worksheet to view additional information and add it to your cart.
Click View cart to preview your file data selections as a virtual file.
From this window, you can edit the virtual table name to be created from file data. Additionally, you can edit schema names, remove a selection from your cart, or edit the list of remote files that participate in a merged table.

Select the appropriate option to assign the virtual table to be created from file data.

Assign to	When to use this option
Project	Select Project if you created the virtual table to use in a specific project. Then, choose the appropriate project. The table also appears in Virtualized data.
Virtualized data	Select Virtualized data if the table was not created to use in a specific project. This setting is the default if no projects exist.

Select Publish to catalog if you also want to publish to a selected catalog.
A list of available catalogs is shown in the drop-down menu. Each catalog is tagged as Governed or Not governed.
Note: You must have at least one catalog in Watson Knowledge Catalog.
You must have permission to publish to a catalog. An administrator can enable whether all virtual objects are published to a selected governed catalog, which prevents a user from publishing to a specified catalog.
Specify a schema in the Schema field.
You can also create a schema by following these steps.
- If you have the Watson Query Engineer or User role, leave the Schema field as default to create a schema with your user ID.
- If you have the Watson Query Admin role, leave the Schema field as default to create a schema with your user ID or enter the new schema name in the Schema field.
For more information, see Creating schemas for virtual objects.
Click Virtualize to complete the process.
When the status window appears, you can select to view your virtualized data or virtualize more data. However, you must wait until virtualization is complete before you navigate away from the page.

Results

If Watson Query and Watson Knowledge Catalog are installed in the same OpenShift® project (namespace), your virtual object is published to the primary catalog.

What to do next

You can collect statistics for your virtual object. For more information, see Collecting statistics in Watson Query.
On the Virtualized data page, you can publish your virtual object to the catalog. For more information, see Publishing virtual data to the catalog.
You can also create join multiple virtual tables to create a joined view. See Creating a join view from multiple tables.