Creating a data source
A data source is a repository or a system that stores and provides data for use in applications or systems. They are storage locations that provide data to use in AI workflows. It serves as the foundation for feeding information into the data pipeline. You have options to choose from, including Amazon S3 and IBM Storage Scale file system.
Before you begin
- Verify whether the manual setup to access Kafka is configured for the cluster associated with this data source. For more information about the configuration, see Manually set up to connect to Kafka broker. You must do the configuration every time you rotate the certificate of the Kafka broker.
- For S3 type data source prerequisites, see Prerequisites for S3 type data source.
About this task
- IBM Fusion Content-Aware Storage (CAS) does not directly interface with S3 storage providers. It accesses S3 storage through the IBM Storage Scale S3 filesystem and with AFM feature enablement provides a cached copy of the S3 content.
- Up to 25 CAS data sources are
supported. Each data source can be one of the following:
- An external S3 bucket using a IBM Storage Scale AFM fileset to ingest the S3 content.
- A fileset residing on the IBM Storage Scale filesystem attached to CAS (without AFM).
- The following constraints exist for change notifications for filesets that reside in the IBM Storage Scale file system is remote mounted by IBM Storage Scale CNSA (without AFM):
- Up to 10 million files per fileset
- Up to 100 million files total monitored
- Only independent fileset is supported.
Procedure
What to do next
Log in to your Scale cluster to check the
permission. For example, if your junction path is /gpfs/gpfs3/my-data/, run the
following command to change the path:
cd /gpfs/gpfs3/
ls -la
Example output:
drwxr-x--- 2 root cas 4096 May 22 19:50 my-data [root@tc11scale1 gpfs3]# getent group cas | cut -d: -f3 310
If the group owner is not root, then run the following command to add the
annotation to the previously created datasource. In this example, the group owner is
cas
.oc annotate DataSource datasource-name group-id='310' --overwrite
Here,
the datasource-name
and 310
are example values that can change
depending on the datasource name and the GID set in Scale.