You can use data that is stored as files on IBM® Cloud Object Storage, Amazon S3, or Ceph® data sources to create virtual tables. Segment or
combine data from one or more files to create a virtual table.
Procedure
- On the navigation menu, click .
The service menu opens to the Data
sources page by default.
- On the service menu, click
and then click the Files tab.
The list of available data sources appears. You can narrow down the displayed assets by
using the available filters.
If you specified a bucket name when you set up the data source
connection, click to expand the object storage connection details to see the
Service type and Bucket information. If you didn't
specify a bucket name when you set up the data source connection, you can use the
Bucket input field to find a specific bucket in the endpoint.
If the
list of data sources does not appear, click Refresh.
- Select the object storage endpoint where you want to browse files and file
paths.
For Cloud Object Storage, the endpoint
is the URL for the object storage.
A list of file paths or buckets in the
endpoint appears. You can navigate through the file path structure or you can click
to see
details of the contents of the first file in the file path.
You cannot add the bucket to your
cart. To add a file path to your cart, you must select the file path to preview the files in the
path, and then click Add to cart. You cannot select a file at the bucket
level, you must add the file to a file path in the bucket.
- Select the file or file path that you want to virtualize and click Add to
cart.
Important:
- You cannot virtualize a single file
in a file path that contains multiple files. The URL resolves to the parent path where the file is
located and the entire file path is virtualized. If you want to virtualize a single file, you can
move it to a separate file path. The separate file path must not be a sub-file path of any other
file path that is, or will be, virtualized.
- Files that you want to virtualize must be within a file path and not at the same level as the
bucket. For example, you cannot virtualize a file
s3a://mynewbigsqlbucket/mydata.csv; you must put
mydata.csv into a file path and virtualize
s3a://mynewbigsqlbucket/fi1epath1/mydata.csv because the virtualization process
cannot create an external table by using only a bucket name without a path.
-
Click View cart to preview your file data selections as a virtual file.
From this window, you can edit schema names, preview files that participate in a merged table,
or remove a selection from your cart.
If you have Watson™ Knowledge
Catalog installed, you can publish your virtual table
to a catalog. For more information, see Publishing virtual data to a catalog with Data Virtualization.
- Recommended: Update the type of partitioned columns from STRING to something more
appropriate. Manually inspect and specify proper types for partitioning columns for best
performance.
- Optional: Click and
select Edit columns.
You
can edit any column name that is not tagged as a Partitioning column and
change column types by using the drop-down menu. When you are happy with your edits, click
Apply. Updated column names are shown after you virtualize the
table.
- Select the appropriate option to assign the virtual table to be created from file data:
Assign to |
When to use this option |
Data request |
Select Data request if you created the virtual table in response to a
data request. Then, choose the appropriate request. The table also appears in Virtualized
data. |
Project |
Select Project if you created the virtual table to use in a specific
analytics project. Then, choose the appropriate project. The table also appears in
Virtualized data. |
Virtualized data |
Select Virtualized data if the table was not created in response to a data request or to use in a specific project. This setting is
the default if no data requests or projects exist. |
- Select Publish to catalog if you also
want to publish to a selected catalog.
A list of available catalogs is shown in the
drop-down menu. Each catalog is tagged as
Governed or
Not
governed.
Note: You must have at least one catalog in
Watson Knowledge
Catalog.
You must have permission to publish to a
catalog. An administrator can enable whether all virtual objects are published to a selected
governed catalog, which prevents a user from publishing to a specified
catalog.
- Specify a schema in the Schema
field.
You can also create a schema by following these
steps.
- If you have the Data
Virtualization
Engineer or User role, leave the
Schema field as default to create a schema with your user ID.
- If you have the Data
Virtualization
Admin role, leave the
Schema field as default to create a schema with your user ID or enter the new
schema name in the Schema field.
-
Click Virtualize to complete the process.
When the status window appears, you can select to view your virtualized data
or virtualize more data. However, you must wait until virtualization is complete before you navigate
away from the page.
Results
If Data
Virtualization and Watson Knowledge
Catalog are installed in the same OpenShift® project (namespace), your virtual object is
published to the primary catalog.
What to do next
- View the table structure and metadata.
- Manage access to the table.
- Edit column names and the types of your object storage assets so that you can prepare accurate
data for virtualization.
- Refresh partitions for your virtualized table so that your data is up to date
and accurate. Click Refresh partitions in the overflow menu on the
Virtualized data page to identify new partitions. For more information, see
Data sources in object storage in Data Virtualization.
- Collect statistics for your virtualized table to optimize query performance. For more
information, see Collecting statistics in Data Virtualization.
-
Optionally, on the Virtualized data page, publish
your virtual object to the catalog. For more information, see Publishing virtual data to the catalog.