External Resources
IBM StreamSets engines can require access to external files and libraries, depending on how you design pipelines.
For example, JDBC stages require a JDBC driver to access the database. When you use a JDBC stage, you must make the driver available as an external resource.
- None
- Use no configured source when using a single engine instance to get started with IBM StreamSets, or when your pipelines do not require external resources.
- Archive file
- Use an archive file that includes the external resources when the deployment launches multiple engine instances and when your pipelines require external resources.
External Resource Types
| External Resource Type | Description |
|---|---|
| Runtime resource files | Files that define pipeline property values that are called from
within a pipeline. For more information, see:
|
| External libraries | External libraries required by pipeline stages. External
libraries can include JDBC or JMS drivers or external Java
libraries. For more information, see:
|
| Custom stage libraries | Stage libraries for custom stages. For example, you might develop
a custom processor to perform custom processing in a pipeline.
For more information, see Custom Stage Libraries
in the Data Collector engine documentation.
Important: To use custom
stage libraries, you must configure the deployment to use an
external resource archive.
|
No Source
Configure the deployment to use no source for external resources when using a single engine instance to get started with IBM StreamSets, or when your pipelines do not require external resources.
- Runtime resource files - Use the engine details page.
- External libraries - Use the pipeline canvas or the engine details page. Tip: In most cases, using the pipeline canvas is simpler because the stage library that requires access to the external library is automatically selected for you.
Uploading Resources from the Pipeline Canvas
When using no configured source for the deployment, you can upload external libraries, such as JDBC drivers, from the pipeline canvas.
Uploading Resources from the Engine Details
When using no configured source for the deployment, you can upload runtime resource files and external libraries from the engine details page.
Removing Resources
When using no configured source for the deployment, you can remove existing runtime resource files and external libraries from the engine details page.
Remove outdated or unused runtime resource files and external libraries to prevent them from being incorrectly used when running a pipeline.
Archive File as the Source
Configure the deployment to use an external resource archive when a deployment launches multiple engine instances and when your pipelines require external resources.
You typically configure a deployment to use an external resource archive when you are ready to move to production, after you have finished building your pipelines and have finalized the list of external resources that your pipelines require.
You generate an archive file in the TGZ or ZIP format, using the required folder names and directory structure. You store the file in a location that is accessible to all machines running an engine instance for the deployment. Then, you edit the deployment to define the location of the archive file.
After you configure the external resource archive and restart all engine instances in the deployment, the archive file contents are extracted and copied into each engine instance.
When your pipelines require additional external resources, you extract the archive file, add the additional resources, and then compress the archive file again.
Archive Structure
An external resource archive file must use the required folder names and directory structure.
- resources
- The resources directory must include text files created for runtime resources.
- streamsets-libs-extras
- The streamsets-libs-extras directory must include a
subdirectory for each set of required external libraries based on the stage
library name, as follows:
<stage library name>/lib/ - user-libs
- The user-libs directory must include a subdirectory for each custom stage.
If your pipelines do not use one of the external resource types, you can omit that directory. For example, if you have not developed custom stage libraries, you do not need to include the user-libs directory.
Sample
Let's look at the contents of a sample external resource archive file created for Data Collector.
This sample archive file includes a runtime resource file named JDBC.txt, the MySQL JDBC driver for stages included in the JDBC stage library, and the Oracle JDBC driver for the Oracle Bulkload origin included in the JDBC Oracle stage library. It does not include any custom stage libraries:
externalResources
resources
JDBC.txt
streamsets-libs-extras
streamsets-datacollector-jdbc-lib
lib
mysql-connector-java-8.0.12.jar
streamsets-datacollector-jdbc-oracle-lib
lib
ojdbc8-19.3.0.0.jar
Archive Location
Use the location most appropriate for your deployment type.
For example, for a self-managed deployment of engines to local on-premises machines, you might store the external resource archive file on a networked file system.
For a cloud service provider deployment, it's typically simpler to store the external resource archive file with that same cloud service provider. For example, for an Amazon EC2 deployment, you might store the file in Amazon S3.
File System
You can store an external resource archive file on a local or network file system. To ensure that all engine instances managed by the deployment can access the file, mount that directory from all engine machines or provide all engine machines access to that file using an HTTP URL.
When you configure the external resource location for the deployment, enter the path to the file. For example:
Web Server
You can store an external resource archive file on a web server. To ensure that all engine instances managed by the deployment can access the file, provide all engine machines access to that file using an HTTP URL.
When you configure the external resource location for the deployment, enter the required URL format for the web server. For example:
https://<hostname>:<port>/shared/externalResources.tgz
Cloud Service
You can store an external resource archive file on one of the following cloud services:
- Amazon S3
- You can store the file in a private or public Amazon S3 bucket, based on the
deployment type:
- Private bucket - Supported for Amazon E2 deployments only. To ensure that all engine instances managed by the Amazon EC2 deployment can access the file, your AWS administrator must grant the AWS instance profile associated with the provisioned EC2 instances read access to the bucket.
- Public bucket - Supported for all deployment types, as long as you share the bucket publicly.
- Google Cloud Storage
- You can store the file in a private or public Google Cloud Storage bucket, based
on the deployment type:
- Private bucket - Supported for GCE deployments only. To ensure that all engine instances managed by the GCE deployment can access the file, your Google Cloud administrator must grant the GCP instance service account associated with the provisioned VM instances read access to the bucket.
- Public bucket - Supported for all deployment types, as long as you share the bucket publicly.
- Azure Blob Storage or Azure Data Lake Storage Gen2
- You can store the file in a private or public Azure Blob Storage or Azure Data
Lake Storage Gen2 container, based on the deployment type:
- Private container - Supported for Azure VM deployments only. To ensure that all engine instances managed by the Azure VM deployment can access the file, your Azure administrator must grant the Azure managed identity associated with the provisioned VM instances read access to the container.
- Public container - Supported for all deployment types, as long as you share the container publicly.
Setting Up an Archive
Set up an external resource archive after you have finalized the list of external resources that your pipelines require.
Updating an Archive
When a deployment uses an external resource archive and your pipelines require additional resources, you manually update the archive file to include new external resources and then restart all engine instances in the deployment.