Creating a data source
A data source is a repository or a system that stores and provides data for use in applications or systems. They are storage locations that provide data to use in AI workflows.
Before you begin
- For prerequisites specific to S3 type data source, see S3 type data source.
- For prerequisites and limitations specific to NFS data sources, see Network File System (NFS) type data source.
About this task
- The data source serves as the foundation for feeding information into the data pipeline. You have the options of:
- Object Storage such as Amazon S3
- IBM Storage Scale
- Network File System (NFS)
- Content-Aware Storage (CAS) does not directly interface with S3 storage providers. It accesses S3 storage through the IBM Storage Scale S3 filesystem and with Active File Management (AFM) feature enablement provides a cached copy of the S3 content.
- Up to 25 CAS data sources are
supported. Each data source can be one of the following:
- An external S3 bucket using a IBM Storage Scale AFM fileset to ingest the S3 content.
- A fileset residing on the IBM Storage Scale filesystem attached to CAS (without AFM).
- CAS supports multiple IBM Storage Scale filesystems. However, all remote filesystems must belong to a single remote IBM Storage Scale cluster.
- The following constraints exist for change notifications for filesets that reside in the IBM Storage Scale file system is remote mounted by IBM Fusion Global Data Platform service (without AFM):
- Up to 10 million files per fileset
- Up to 100 million files total monitored
- Only independent fileset is supported.
Procedure
What to do next
Log in to your Scale cluster to check the
permission. For example, if your junction path is /gpfs/gpfs3/my-data/, run the
following command to change the path:
cd /gpfs/gpfs3/ls -laExample output:
drwxr-x--- 2 root cas 4096 May 22 19:50 my-data [root@tc11scale1 gpfs3]# getent group cas | cut -d: -f3 310
If the group owner is not root, then run the following command to add the
annotation to the previously created datasource. In this example, the group owner is
cas.oc annotate DataSource datasource-name group-id='310' --overwriteHere,
the datasource-name and 310 are example values that can change
depending on the datasource name and the GID set in
Scale.