Creating a data source

A data source is a repository or a system that stores and provides data for use in applications or systems. They are storage locations that provide data to use in AI workflows. It serves as the foundation for feeding information into the data pipeline. You have options to choose from, including Amazon S3 and IBM Storage Scale file system.

Before you begin

About this task

  • IBM Fusion Content-Aware Storage (CAS) does not directly interface with S3 storage providers. It accesses S3 storage through the IBM Storage Scale S3 filesystem and with AFM feature enablement provides a cached copy of the S3 content.
  • Up to 25 CAS data sources are supported. Each data source can be one of the following:
    • An external S3 bucket using a IBM Storage Scale AFM fileset to ingest the S3 content.
    • A fileset residing on the IBM Storage Scale filesystem attached to CAS (without AFM).
  • The following constraints exist for change notifications for filesets that reside in the IBM Storage Scale file system is remote mounted by IBM Storage Scale CNSA (without AFM):
    • Up to 10 million files per fileset
    • Up to 100 million files total monitored
  • Only independent fileset is supported.

Procedure

  1. From the menu, go to Content-aware storage > Data source.
  2. In the Data sources page, click Connect data source.
  3. Enter the name of the Data source.
  4. Select the storage type and click Next.
    The available storage types are IBM Cloud, IBM Storage Scale, AWS, S3 Compliant.
  5. Enter the following details in the Connection details page.
    The Connection details page vary based on your storage type selection.
    • IBM Cloud, AWS, or S3 Compliant
      • Enter the Endpoint.

        It refers to the URL through which you can access the bucket and its contents. For more information about Endpoint rules, see AWS documentation.

      • Enter the name of the S3 Bucket.

        For more information about the bucket naming guidelines, see AWS documentation.

      • For AWS, enter the Region. It is the Amazon Web Services region where the bucket is located.
      • Enter the Access key and Secret access key

        These are security credentials needed to access the contents of the bucket.

      • In the Certificate settings section, enter the Secret name for certificate.

        This is an optional parameter. SSL secured object storage locations require certificates for authentication. Create an OpenShift TLS secret in namespace ibm-storage-fusion-ns or your Fusion namespace. Provide the secret name for the credentials.

    • IBM Storage Scale

      Enter the Path. It is the junction path.

  6. In the Caching filesystem section, select the file system where the cache of this data source must be stored.
    If there is only one remote file system detected, it gets automatically selected and this field is not available for selection.
  7. Click Connect to submit the information so that CAS can be enabled for the data source.

What to do next

For IBM Storage Scale type data source, determine the group owner that has read and execute permission to the junction path to allow CAS to read its files and view the directory.
Log in to your Scale cluster to check the permission. For example, if your junction path is /gpfs/gpfs3/my-data/, run the following command to change the path:
cd /gpfs/gpfs3/
ls -la
Example output:
drwxr-x---  2 root cas    4096 May 22 19:50 my-data
[root@tc11scale1 gpfs3]# getent group cas | cut -d: -f3
310
If the group owner is not root, then run the following command to add the annotation to the previously created datasource. In this example, the group owner is cas.
oc annotate DataSource datasource-name group-id='310' --overwrite

Here, the datasource-name and 310 are example values that can change depending on the datasource name and the GID set in Scale.