Creating a data source

A data source is a repository or a system that stores and provides data for use in applications or systems. They are storage locations that provide data to use in AI workflows.

Before you begin

For prerequisites specific to S3 type data source, see S3 type data source.
For prerequisites and limitations specific to NFS data sources, see Network File System (NFS) type data source.

About this task

The data source serves as the foundation for feeding information into the data pipeline.
You have the options of:
- Object Storage such as Amazon S3
- IBM Storage Scale
- Network File System (NFS)
Content-Aware Storage (CAS) does not directly interface with S3 storage providers. It accesses S3 storage through the IBM Storage Scale S3 filesystem and with Active File Management (AFM) feature enablement provides a cached copy of the S3 content.
Up to 25 CAS data sources are supported. Each data source can be one of the following:
- An external S3 bucket using a IBM Storage Scale AFM fileset to ingest the S3 content.
- A fileset residing on the IBM Storage Scale filesystem attached to CAS (without AFM).
CAS supports multiple IBM Storage Scale filesystems. However, all remote filesystems must belong to a single remote IBM Storage Scale cluster.
The following constraints exist for change notifications for filesets that reside in the IBM Storage Scale file system is remote mounted by IBM Fusion Global Data Platform service (without AFM):
- Up to 10 million files per fileset
- Up to 100 million files total monitored
Only independent fileset is supported.

Procedure

From the menu, go to Content-aware storage > Data source.
In the Data sources page, click Connect data source.
Enter the name of the Data source.
Select the storage type and click Next.
The available storage types are IBM Cloud, Storage Scale, AWS, S3 Compliant, and NFS.
Enter the following details in the Connection details page.
The Connection details page vary based on your storage type selection.
- IBM Cloud, AWS, or S3 Compliant
  - Enter the Endpoint.
    It refers to the URL through which you can access the bucket and its contents. For more information about Endpoint rules, see AWS documentation.
  - Enter the name of the S3 Bucket.
    For more information about the bucket naming guidelines, see AWS documentation.
  - For AWS, enter the Region. It is the Amazon Web Services region where the bucket is located.
  - Enter the Access key and Secret access key
    These are security credentials needed to access the contents of the bucket.
  - In the Certificate settings section, enter the Secret name for certificate.
    This is an optional parameter. SSL secured object storage locations require certificates for authentication. Create an OpenShift TLS secret in the ibm-spectrum-fusion-ns namespace or your IBM Fusion namespace. Provide the secret name for the credentials. For more information about creating a secret, see Creating a secret.
- Storage Scale
  Enter the Path. It is the junction path.
- NFS
  Provide the following information:
  - NFS server: Enter the NFS server hostname or IP address.
    Note: Although this field is defined as a list, only a single NFS server is supported in the CAS 1.1.5 release.
  - Export path: Enter the NFS export path to monitor. For example: /home/user/data
In the Caching filesystem section, select the file system where the cache of this data source must be stored.
If there is only one remote file system detected, it gets automatically selected and this field is not available for selection.
Click Connect to submit the information so that CAS can be enabled for the data source.

What to do next

For IBM Storage Scale type data source, determine the group owner that has read and execute permission to the junction path to allow CAS to read its files and view the directory.

Log in to your Scale cluster to check the permission. For example, if your junction path is /gpfs/gpfs3/my-data/, run the following command to change the path:

cd /gpfs/gpfs3/

ls -la

Example output:

drwxr-x---  2 root cas    4096 May 22 19:50 my-data
[root@tc11scale1 gpfs3]# getent group cas | cut -d: -f3
310

If the group owner is not root, then run the following command to add the annotation to the previously created datasource. In this example, the group owner is cas.

oc annotate DataSource datasource-name group-id='310' --overwrite

Here, the datasource-name and 310 are example values that can change depending on the datasource name and the GID set in Scale.