Creating data source

A data source is a repository or a system that stores and provides data for use in applications or systems. They are storage locations that provide processed data to use in AI workflows. It serves as the foundation for feeding information into the data pipeline. You have options to choose from, including Amazon S3 and IBM Storage Scale file system.

Before you begin

Verify whether the manual setup to access Kafka is configured for the cluster associated with this data source. For more information about the configuration, see Manually set up to connect to Kafka broker. You must do the configuration every time you rotate the certificate of the Kafka broker.

About this task

  • IBM Fusion Content-Aware Storage (CAS) does not directly interface with S3 storage providers. Instead, it accesses S3 storage through the IBM Storage Scale S3 filesystem, which provides a cached copy of the S3 content.
  • Up to 25 CAS data sources are supported. Each data source can be one of the following:
    • An external S3 bucket using a IBM Storage Scale AFM fileset to ingest the S3 content.
    • A fileset residing on the IBM Storage Scale filesystem attached to CAS (without AFM).
  • The following constraints exist for change notifications for filesets that reside in the IBM Storage Scale file system is remote mounted by IBM Storage Scale CNSA (without AFM):
    • Up to 10 million files per fileset
    • Up to 100 million files total monitored
  • If the fileset is an independent fileset, only nested dependent filesets are included in the watch of an independent fileset. If the fileset is a dependent fileset, no nested filesets are watched within a dependent fileset.

Procedure

  1. From the menu, go to Constent-aware storage > Data source.
  2. In the Data sources page, click Connect data source.
  3. Enter the name of the Data source.
  4. Select S3 or IBM Storage Scale for Type.
  5. Enter the connection information based on your choice of type:
    S3
    1. Enter the Account number. It is a unique identifier assigned to each
    2. Enter the Account key. It is security credentials.
    3. Enter the Endpoint. It refers to the URL through which you can access the bucket and its contents.
    4. For Amazon Web Services, enter the Bucket and Region. The Amazon Web Services region is where your bucket is located.
    5. Certificate settings

      SSL secured object storage locations require certificates from authentication. Create an OpenShift TLS secret in namespace ibm-storage-fusion-ns. Provide the secrete name for the credentials.

    IBM Storage Scale
    Enter the Filesystem Path.
  6. Select the file system where the cache of this data source must be stored.

    If only a single remote file system is detected, it is selected automatically, and the input to choose a file system remains hidden. The input is visible only when multiple file systems are configured.

  7. Click Connect to submit the information so that CAS can be enabled for the data source. CAS sets up AFM for the data source and configures a watch folder for the corresponding fileset.
    For a S3 source, CAS sets up AFM. For either S3 or Scale type data source, CAS configures a watch folder for the corresponding fileset.
  8. If a Group ID is defined for the Scale fileset, run the following command to add the annotation to the previously created datasource previously:
    oc annotate DataSource **datasource-name** group-id='**310**' --overwrite

    Here, the datasource-name and 310 are example values that can change depending on the datasource name and the GID set in Scale.