A data source is a repository or a system that stores and provides data for use in
applications or systems. They are storage locations that provide processed data to use in AI
workflows. It serves as the foundation for feeding information into the data pipeline. You have
options to choose from, including Amazon
S3 and IBM Storage Scale file system.
Before you begin
Verify whether the manual setup to access Kafka is configured for the cluster associated with
this data source. For more information about the configuration, see Manually set up to connect to Kafka broker. You must do the configuration every time you rotate the
certificate of the Kafka broker.
About this task
- IBM Fusion
Content-Aware Storage (CAS) does not directly interface with S3
storage providers. Instead, it accesses S3 storage through the IBM Storage Scale S3 filesystem, which provides a cached
copy of the S3 content.
- Up to 25 CAS data sources are
supported. Each data source can be one of the following:
- An external S3 bucket using a IBM Storage Scale AFM
fileset to ingest the S3 content.
- A fileset residing on the IBM Storage Scale
filesystem attached to CAS (without
AFM).
- The following constraints exist for change notifications for filesets that reside in the IBM Storage Scale file system is remote mounted by IBM Storage Scale CNSA (without AFM):
- Up to 10 million files per fileset
- Up to 100 million files total monitored
- If the fileset is an independent fileset, only nested dependent filesets are included in the
watch of an independent fileset. If the fileset is a dependent fileset, no nested filesets are
watched within a dependent fileset.
Procedure
-
From the menu, go to .
- In the Data sources page, click Connect data source.
- Enter the name of the Data source.
- Select S3 or IBM Storage Scale for Type.
- Enter the connection information based on your choice of type:
- S3
-
- Enter the Account number. It is a unique identifier assigned to each
- Enter the Account key. It is security credentials.
- Enter the Endpoint. It refers to the URL through which you can access the
bucket and its contents.
- For Amazon Web Services, enter the Bucket and
Region. The Amazon Web Services region is where your bucket is located.
- Certificate settings
SSL secured object storage locations require
certificates from authentication. Create an OpenShift TLS secret in namespace
ibm-storage-fusion-ns
. Provide the secrete name for the credentials.
- IBM Storage Scale
- Enter the Filesystem Path.
- Select the file system where the cache of this data source must be stored.
If only a single remote file system is detected, it is selected automatically, and the input to
choose a file system remains hidden. The input is visible only when multiple file systems are
configured.
- Click Connect to submit the information so that CAS can be enabled for the data source. CAS sets up AFM for the data source and
configures a watch folder for the corresponding fileset.
For a S3 source, CAS sets up AFM. For either S3 or Scale type
data source, CAS configures a watch folder
for the corresponding fileset.
- If a Group ID is defined for the Scale fileset, run the following command to add the
annotation to the previously created datasource previously:
oc annotate DataSource **datasource-name** group-id='**310**' --overwrite
Here, the datasource-name
and 310
are example values that can
change depending on the datasource name and the GID set in Scale.