Data Collector Configuration
You can edit the Data Collector
configuration file, $SDC_CONF/sdc.properties
, to configure properties such
as the host name and port number and account information for email alerts.
You can protect sensitive data in Data Collector configuration properties by storing the data in an external location and then using functions provided with the StreamSets expression language to retrieve the data. You can also reference information in an environment variable.
Kerberos Authentication
You can use Kerberos authentication to connect to external systems as well as YARN clusters.
By default, Data Collector uses the user account who started it to connect to external systems. When you enable Kerberos, it can use the Kerberos principal to connect to external systems.
- Hadoop FS Standalone origin
- Kafka Multitopic Consumer origin
- MapR FS Standalone origin
- HBase Lookup processor
- Hive Metadata processor
- Kudu Lookup processor
- Cassandra destination, when the DataStax Enterprise Java driver is installed
- Hadoop FS destination
- HBase destination
- Hive Metastore destination
- Kafka Producer destination
- Kudu destination
- MapR DB destination
- MapR FS destination
- Solr destination
- HDFS File Metadata executor
- MapR FS File Metadata executor
- MapReduce executor
- Spark executor
To enable Data Collector to use Kerberos authentication, use the required procedure for your installation type.
Enabling Kerberos for Tarball or RPM
To enable Kerberos authentication for a tarball or RPM installation, perform the following steps:
Enabling Kerberos with Cloudera Manager
To enable Kerberos authentication for a Cloudera Manager installation, use Cloudera Manager.
When you enable Kerberos through Cloudera Manager, Cloudera Manager creates the required Kerberos principal and keytab.
- In Cloudera Manager, select the StreamSets service and then click Configuration.
- Select Enable Kerberos Client.
- In the Cloudera Manager home page, click .
- Click Kerberos Credentials.
- Click Generate Missing Credentials.
- Restart Data Collector.
- Configure the stage to use Kerberos.
Sending Email
You can configure email configuration properties to enable Data Collector to send email notifications.
- Email alert - Sends a basic email when an email-enabled alert is triggered, such as when the error record threshold has been reached.
- Pipeline notification - Sends a basic email when the pipeline state changes to a specified state. For example, you might use pipeline notification to send an email when a pipeline transitions to a Run_Error or Finished state.
- Email executor -
Sends a custom email upon receiving an event from an event-generating stage. Use
in an event stream to send a user-defined email. You can include expressions to
provide information about the pipeline or event in the email.
For example, you might use an Email executor to send an email upon receiving a failed-query event from the Hive Query executor, and you can include the failed query in the message.
To enable sending email, in the Data Collector configuration file, configure the mail.transport.protocol property, and then configure the smtp/smtps properties and the xmail properties. For more information, see Configuring Data Collector.
Protecting Sensitive Data in Configuration Files
You can protect sensitive data in Data Collector
configuration file by storing the data in an external location and then using the file
or
exec
function to retrieve the data.
$SDC_CONF/sdc.properties
file and any
additional files included in the Data Collector
configuration, such as the following files: - dpm.properties
- vault.properties
- credential-stores.properties
Some configuration properties, such as the https.keystore.password
property,
require that you enter a password. Instead of entering the password in clear text, you can
store the password outside of the configuration file and then use the file
or exec
function to retrieve the
sensitive data.
- From a file
- Store the sensitive data in a separate file and then use the
file
function in the configuration file to retrieve the data as follows:${file("<filename>")}
- Using a script or executable
- For increased security, develop a script or executable that retrieves the sensitive data from an external location. For example, you can develop a script that decrypts an encrypted file containing a password. Or you can develop a script that calls an external REST API to retrieve a password from a remote vault system.
When you use either the file
or the exec
function, Data Collector uses the
exact output of the file or script. So if the output produces a password and then a newline
character, Data Collector uses the value with the newline character. This causes Data Collector to use a
password that is not valid. Carefully design and test how you define the output of the file or
script to ensure that the functions return only the expected sensitive data.
Retrieving Sensitive Data from Files
Use the file
function in a configuration file to retrieve sensitive data from a local file.
You can store a single piece of information in a file. When Data Collector starts, it retrieves the sensitive data from the referenced files.
Retrieving Sensitive Data Using Scripts
Use the exec
function in a configuration file to call a script or executable that retrieves sensitive data from an external location.
You must save the script on the local machine where Data Collector runs. When Data Collector starts, it runs the script to retrieve the sensitive data.
Referencing Environment Variables
$SDC_CONF/sdc.properties
, as follows: ${env("<environment variable>")}
You can also use this format to define runtime properties in the Data Collector configuration file or in separate runtime properties files.
Running Multiple Concurrent Pipelines
By default, Data Collector can run approximately 22 standalone pipelines concurrently. If you plan to run a larger number of pipelines at the same time, increase the thread pool size.
The runner.thread.pool.size
property in the Data Collector
configuration file, determines the number of threads in the pool that are available to run
standalone pipelines. One running pipeline requires five threads, and pipelines
share threads in the pool.
-
Use a text editor to open
sdc.properties
. - Calculate the approximate runner thread pool size by multiplying the number of running pipelines by 2.2.
-
Set the
runner.thread.pool.size
property to your calculated value. - To enable the changes, restart Data Collector.
Hadoop Impersonation Mode
You can configure how Data Collector impersonates a Hadoop user when performing tasks, such as reading or writing data, in Hadoop systems.
- As the user defined in stage properties - When configured, Data Collector uses the user defined in Hadoop-related stages.
- As the currently logged in Data Collector user who starts the pipeline - When no user is defined in a Hadoop-related stage, Data Collector uses the user who starts the pipeline.
The system administrator can configure Data Collector to
always use the user who starts the pipeline by enabling the
stage.conf_hadoop.always.impersonate.current.user
property in the
Data Collector
configuration file. When enabled, configuring a user within a stage is not allowed.
Configure Data Collector to always impersonate as the user who starts the pipeline when you want to prevent access to data in Hadoop systems by stage-level user properties.
For example, say you use roles, groups, and pipeline permissions to ensure that only authorized operators can start pipelines. You expect that the operator user accounts are used to access all external systems. But a pipeline developer can specify a HDFS user in a Hadoop stage and bypass your attempts at security. To close this loophole, configure Data Collector to always use the currently logged in Data Collector user to read from or write to Hadoop systems.
To always use the user who starts the pipeline, in the Data Collector
configuration file, uncomment the
stage.conf_hadoop.always.impersonate.current.user
property and
set it to true
- Hadoop FS Standalone origin and Hadoop FS destination
- MapR FS Standalone origin and MapR FS destination
- HBase lookup and destination
- MapR DB destination
- HDFS File Metadata executor
- MapR FS File Metadata executor
- MapReduce executor
Lowercasing User Names
When Data Collector impersonates Hadoop users to perform tasks in Hadoop systems, you can also configure Data Collector to lowercase all user names before passing them to Hadoop.
When the Hadoop system is case sensitive and the user names are lower case, you might use this property to lowercase mixed-case user names that might be returned, for example, from a case-insensitive LDAP system.
To lowercase user names before passing them to Hadoop, uncomment the
stage.conf_hadoop.always.lowercase.user
property and set it to
true.
Using a Partial Control Hub User ID
When Data Collector is registered with Control Hub, you can configure Data Collector to use an abbreviated version of the Control Hub user ID to impersonate a Hadoop user.
<ID>@<organization ID>
You can configure Data Collector
to use only the ID, ignoring "@<organization
ID>
". For example, using myname
instead of
myname@org
as the user name.
You might need to use a partial Control Hub user ID when the Hadoop system uses Kerberos, LDAP, or other user authentication methods with user name formats that conflict with the Control Hub format.
To enable using a partial Control Hub
user ID for a registered Data Collector, uncomment the dpm.alias.name.enabled
property in the Control Hub Configuration File.
Working with HDFS Encryption Zones
Hadoop systems use the Hadoop Key Management Server (KMS) to obtain encryption keys. Data Collector requires a truststore file to verify the identity of the KMS server.
To enable access to HDFS encryption zones while using proxy users, configure KMS to allow the same user impersonation as you have configured for HDFS.
To create a truststore file, follow the same steps that you do when enabling HTTPS. See Step 2. Create a Truststore File.
hadoop.kms.proxyuser.sdc.groups
hadoop.kms.proxyuser.sdc.hosts
For example, the following properties allows users in the Ops group access to the encryption zones:
<property>
<name>hadoop.kms.proxyuser.sdc.groups</name>
<value>Ops</value>
</property>
<property>
<name>hadoop.kms.proxyuser.sdc.hosts</name>
<value>*</value>
</property>
Note that the asterisk (*) indicates no restrictions.
For more information about configuring KMS proxy users, see the KMS documentation for the Hadoop distribution that you are using. For example, for Apache Hadoop, see the Apache Hadoop documentation.
Blocklist and Allowlist for Stage Libraries (6.1 and later)
By default, almost all installed stage libraries are available for use in Data Collector. In Data Collector 6.1 and later, you can use blocklist and allowlist properties to limit the stage libraries that can be used.
To limit the stage libraries created by StreamSets, use one of the following properties:
system.stagelibs.allowlist
system.stagelibs.blocklist
user.stagelibs.allowlist
user.stagelibs.blocklist
The MapR stage libraries are included in the blocklist by default. To use one of the MapR stage libraries, run the MapR setup script as described in MapR Prerequisites.
Blacklist and Whitelist for Stage Libraries (6.0)
By default, almost all installed stage libraries are available for use in Data Collector. In Data Collector 6.0, you can use blacklist and whitelist properties to limit the stage libraries that can be used.
To limit the stage libraries created by IBM StreamSets, use one of the following properties:
system.stagelibs.whitelist
system.stagelibs.blacklist
user.stagelibs.whitelist
user.stagelibs.blacklist
The MapR stage libraries are blacklisted by default. To use one of the MapR stage libraries, run the MapR setup script as described in MapR Prerequisites.
Advanced Thread Pool Properties
The Data Collector
configuration file includes a runner.thread.pool.size
property
described in Running Multiple Concurrent Pipelines.
Though the existing Data Collector configuration properties provide the configuration abilities that most users generally need, when necessary, you can add and configure advanced thread pool properties.
Advanced Thread Pool Property | Description |
---|---|
runner_stop.thread.pool.size | Thread pool size used to force stop pipelines. Default is the
value set for the |
event.executor.thread.pool.size | Thread pool size used to react to pipeline events. Default is
the value set for the |
manager.executor.thread.pool.size | Thread pool size used to manage background processes. Default is 4. |
bundle.executor.thread.pool.size | Thread pool size used to create support bundles. Default is 1. |
previewer.thread.pool.size | Thread pool size used for data preview. You might increase this
setting when previewing multiple pipelines at the same time.
Default is 4. |
- Use a text editor to edit the Data Collector configuration file,
$SDC_CONF/sdc.properties
. - Add the advanced thread pool properties that you want to configure, then define values for each property.
- To enable the changes, restart Data Collector.
Configuring Data Collector
You can customize Data Collector by editing the Data Collector configuration file,
sdc.properties
. Use a text editor to edit the Data Collector configuration file,
$SDC_CONF/sdc.properties
. To enable the changes, restart Data Collector.
General Property | Description |
---|---|
sdc.base.http.url | Data Collector URL that is included in emails sent for metric and data alerts. Default is
Be sure to uncomment the property if you change the value. |
http.bindHost | Host name or IP address that Data Collector binds to. You might
want to configure a specific host or IP address when the machine that Data Collector is installed on has
multiple network cards. Default is 0.0.0.0, which means that Data Collector can bind to any host or IP address. Be sure to uncomment the property if you change the value. |
http.maxThreads | Maximum number of concurrent threads the Data Collector web server uses to
serve UI requests. Default is 200. Uncomment the property to change the value, but increasing this value is not recommended. |
http.port | Port number to use for Data Collector. Default is 18630. |
https.port | Secure port number for Data Collector. For example,
18636. Any number besides -1 enables the secure port number. If you use both port properties, the HTTP port bounces to the HTTPS port. Default is -1. For more information, see Enabling HTTPS. |
http2.enable | Enables support of the HTTP/2 protocol for the API. To enable HTTP/2, set this property to
true and configure the https.port property, above.
Do not use with clients that do not support application layer protocol negotiation (ALPN). Default is |
http.enable.forwarded.requests | Enables handling X-Forwarded-For, X-Forwarded-Proto, X-Forwarded-Port HTTP request headers
issued by a reverse proxy such as HAProxy, ELB, or NGINX. Set to Default is |
https.keystore.path | Keystore path and file name used by Data Collector. Enter an absolute
path or a path relative the $SDC_RESOURCES
directory.Note: Default is
keystore.jks in the $SDC_CONF
directory which provides a self-signed certificate that you can use. However, it is best practice to
generate a certificate signed by a trusted CA, as described in Enabling HTTPS. |
https.keystore.password | Password to the Data Collector keystore file. To protect the password, store the password in an
external location and then use a function to retrieve the password.
Default uses the |
https.require.hsts | Requires Data Collector to include the HTTP Strict Transport Security (HSTS) response
header. Set to Default is |
http.session.max.inactive.interval | Maximum amount of time that Data Collector can remain inactive
before the user is logged out. Use -1 to allow user sessions to remain inactive
indefinitely. Default is 86,400 seconds (24 hours). |
http.authentication | HTTP authentication. Use none , basic ,
digest , or form .The HTTP authentication type determines how passwords are transferred from the browser to Data Collector over HTTP. Digest authentication encrypts the passwords. Basic and form authentication do not encrypt the passwords. When using Default is |
http.authentication.login.module | Indicates where user account information resides:
Default is |
http.digest.realm | Realm used for HTTP authentication. Use basic-realm, digest-realm, or form-realm. The
associated realm.properties file must be located in the $SDC_CONF directory.Default is |
http.realm.file.permission.check | Checks the permissions for the realm.properties file in use:
Relevant when http.authentication.login.module is set to |
http.authentication.ldap.role.mapping | Maps groups defined by the LDAP server to Data Collector roles. Enter a
semicolon-separated list as
follows:
Relevant when
http.authentication.login.module is set to |
ldap.login.module.name | Name of the JAAS configuration properties in the $SDC_CONF/ldap-login.conf file. Default is |
http.access.control.allow.origin | List of domains allowed to access the Data Collector REST API for
cross-origin resource sharing (CORS). To restrict access to specific domains, enter a
comma-separated list as
follows:
Default is the asterisk wildcard (*) which means that any domain can access the Data Collector REST API. |
http.access.control.allow.headers | List of HTTP headers allowed during a cross-domain request. |
http.access.control.exposed.headers | List of HTTP headers exposed as part of the cross-domain response. |
http.access.control.allow.methods | List of HTTP methods that can be called during a cross-domain request. |
kerberos.client.enabled | Enables Kerberos authentication for Data Collector. Must be enabled to
allow non-Kafka stages to use Kerberos to access external systems. For more information, see Kerberos Authentication. |
kerberos.client.principal | Kerberos principal to use. Enter a service principal. |
kerberos.client.keytab | Location of the Kerberos keytab file that contains the credentials for the Kerberos
principal. Use a fully-qualified directory or a directory relative to the |
preview.maxBatchSize | Maximum number of records used to preview data. Default is 10. |
preview.maxBatches | Maximum number of batches used to preview data. Default is 10. |
production.maxBatchSize | Maximum number of records included in a batch when the pipeline runs. Default is 50000. |
parser.limit | Maximum parser buffer size that origins can use to process data. Limits the size of the data
that can be parsed and converted to a record. By default, the parser buffer size is 1048576 bytes. To increase the size, uncomment and configure this property. For more information about how this property affects record sizes, see Maximum Record Size. |
production.maxErrorRecordsPerStage | Maximum number of error records to save in memory for each stage to display in Monitor mode.
When the limit is reached, older error records are discarded. Default is 100. |
production.maxPipelineErrors | Maximum number of pipeline errors to save in memory to display in monitor mode.
When the limit is reached, older errors are discarded. Default is 100. |
max.logtail.concurrent.requests | Maximum number of external processes allowed to access the Data Collector log file at the
same time through REST API calls. Default is 5. |
max.webSockets.concurrent.requests | Maximum number of WebSocket calls allowed. |
pipeline.access.control.enabled | Enables pipeline permissions and sharing pipelines. With pipeline permissions enabled, a user must have the appropriate permissions to view or work
with a pipeline. Only Admin users and pipeline owners have full access to pipelines. When pipeline permissions are disabled, access to pipelines is based on the roles assigned to the user and its groups. For more information about pipeline permissions, see Permissions.html#concept_i1p_hzd_yy. Default is |
ui.header.title | Optional custom header to display in Data Collector next to the StreamSets logo.
You can create a header using HTML and include an additional image. To use an image, place the
file in a directory local to the following directory:
For example, to add custom text, you might use the following HTML:
Or to use an image in the
$SDC_DIST/sdc-static-web/ directory, you can use
the following HTML:
We recommend using an image no more than 48 pixels high. |
ui.local.help.base.url | Base URL for the online help installed with Data Collector. Do not change this value. |
ui.hosted.help.base.url | Base URL for the online help. Do not change this value. |
ui.registration.url | URL used to register Data Collector. Do not change this value. |
ui.refresh.interval.ms | Interval in milliseconds that Data Collector waits before
refreshing the UI. Default is 2000. |
ui.jvmMetrics.refresh.interval.ms | Interval in milliseconds that the Data Collector metrics are
refreshed. Default is 4000. |
ui.enable.webSocket | Enables Data Collector to use WebSocket to gather pipeline information. |
ui.undo.limit | Number of recent actions stored so you can undo them. |
ui.default.configuration.view | Displays basic properties for pipelines and pipeline stages by
default. Users can choose to show the advanced options
when configuring a pipeline or stage. Uncomment the property and set it to
|
Email Property | Description |
---|---|
mail.transport.protocol | Use smtp or smtps. Default is |
mail.smtp.host | SMTP host name. Default is |
mail.smtp.port | SMTP port number. Default is 25. |
mail.smtp.auth | Whether the SMTP host uses authentication. Use true or
false .Default is |
mail.smtp.starttls.enable | Whether the SMTP host uses STARTTLS encryption. Use true or
false .Default is |
mail.smtps.host | SMTPS host name. Default is |
mail.smtps.port | SMTPS port number. Default is 25. |
mail.smtps.auth | Whether the SMTPS host uses authentication. Use true or
false .Default is |
xmail.username | User name for the email account to send email. |
xmail.password | Password for the email account. To protect the password, store the password in an
external location and then use a function to retrieve the password. Default uses the |
xmail.from.address | Email address to use to send email. |
Advanced Property | Description |
---|---|
runtime.conf.location | Location of runtime properties. Use to declare where runtime properties are defined:
|
The Data Collector
configuration file includes properties with a java.security.
prefix which you can use to configure
Java security properties. Any Java security properties that you modify in the configuration file
change the JVM configuration. Do not modify the Java security properties when running multiple Data Collector instances within
the same JVM.
The Data Collector configuration file includes the following Java security property:
Java Security Property | Description |
---|---|
java.security.networkaddress.cache.ttl |
Note: This property has been deprecated and may be removed in a future release. If needed, you can configure the
networkaddress.cache.ttl property in the
$SDC_DIST/etc/sdc-java-security.properties file to
cache Domain Name Service (DNS) lookups.Number of seconds to cache Domain Name Service (DNS) lookups. Default is 0, which configures the JVM to use the DNS time to live value. For more information, see the networkaddress.cache.ttl property in the Oracle documentation. |
The Data Collector configuration file includes Security Manager properties that allow you to enable the Data Collector Security Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in Data Collector configuration, data, and resource directories.
By default, Data Collector uses the Java Security Manager that allows stages to access files in all Data Collector directories.
The Data Collector configuration file includes the following Security Manager properties:
Security Manager Property | Description |
---|---|
security_manager.sdc_manager.enable | Enables the Data Collector Security Manager
for enhanced security. The Data Collector Security Manager
does not allow stages to access files in protected Data Collector
directories. Uncomment the property to enable. |
security_manager.sdc_dirs.exceptions | Files in protected directories that can be accessed by all stage libraries when the Data Collector Security Manager is
enabled. Generally, you should not need to change this property. |
security_manager.sdc_dirs.exceptions.<stage_library_name> | Files in protected directories that can be accessed by the specified stage library when the
Data Collector Security
Manager is enabled. Generally, you should not need to change this property. |
Stage-Specific Properties | Description |
---|---|
stage.conf_hadoop.always.impersonate.current.user | Ensures that Hadoop-related stages use the currently logged in Data Collector user to perform
tasks, such as writing data, in Hadoop systems. With this property enabled, Data Collector prevents
configuring an alternate user in Hadoop-related stages. To use this property, uncomment the
property and set it to For more information and a list of affected stages, see Hadoop Impersonation Mode. |
stage.conf_hadoop.always.lowercase.user | Converts the user name to lowercase before passing it to Hadoop. Use to lowercase user names from case insensitive systems, such as a case-insensitive LDAP installation, before passing the user names to Hadoop systems. To use this property, uncomment the property and set it to
|
stage.conf_com.streamsets.pipeline.stage.hive.impersonate.current.user | Enables the Hive Metadata processor, the Hive Metastore destination, and the Hive Query
executor to impersonate the current user when connecting to Hive. Default is
Set to |
stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.load | Lists JDBC drivers that Data Collector automatically loads
for all pipelines. To use this property, uncomment the property and set it to a comma-separated list of JDBC drivers. |
stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL | Enables Data Collector
to attempt to disable SSL for all JDBC connections. Many newer JDBC systems enable SSL by default. When you have JDBC pipelines that do not use SSL, you can use this property to handle JDBC systems with SSL enabled. However, some JDBC vendors do not allow disabling SSL. To use this property, uncomment the property and set it to
|
stage.conf_kafka.keytab.location | Storage location for Kerberos keytabs that are specified in Kafka stages. Keytabs are stored only for the duration of the pipeline run.
Generally, you should not need to change this property. |
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.addrecordstoqueue | Enables the Oracle CDC Client origin to reduce memory usage when the origin is configured to
buffer data locally, in memory. This property is enabled by default. Do not disable this property unless recommended by customer support. |
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.monitorbuffersize | Enables Data Collector
to report memory consumption when the Oracle CDC Client origin uses local buffers. Reporting reduces
pipeline performance, so enable the property only as a temporary troubleshooting
measure. This property is disabled by default. |
stage.conf_com.streamsets.pipeline.stage.executor.shell. shell | Defines the relative or absolute path to the command line interpreter to use to execute
scripts, such as /bin/bash .Default is Used by Shell executors. |
stage.conf_com.streamsets.pipeline.stage.executor. shell.sudo | Defines the relative or absolute path to the sudo to use when executing scripts. Default is
Used by Shell executors. |
stage.conf_com.streamsets.pipeline.stage.executor.shell. impersonation_mode |
Uses the Data Collector
user who starts the pipeline to execute shell scripts defined in Shell executors. When
not enabled, the operating system user who started Data Collector is used to execute
shell scripts. To enable the secure use of shell scripts through the Shell executor, we highly recommend uncommenting this property. Requires the user who starts the pipeline to have a matching user account in the operating system. For more information about the security ramifications, see Data Collector Shell Impersonation Mode. Used by Shell executors. |
Observer Properties | Description |
---|---|
observer.queue.size | Maximum queue size for data rule evaluation requests. Each data rule generates an evaluation
request for every batch that passes through the stream. When the number of requests outstrips the
queue size, requests are dropped. Default is 100. |
observer.sampled.records.cache.size | Maximum number of records to be cached for display for each rule. The exact number of records
is specified in the data rule. Default is 100. You can reduce this number as needed. |
observer.queue.offer.max.wait.time.ms | Maximum number of milliseconds to wait before dropping a data rule evaluation request when the observer queue is full. |
The Data Collector configuration file includes the following miscellaneous properties:
Miscellaneous Property | Description |
---|---|
max.stage.private.classloaders | Maximum number of stage libraries Data Collector allows. Default is 50. |
runner.thread.pool.size | Pre-multiplier size of the thread pool. One running pipeline requires five
threads, and pipelines share threads in the pool. To calculate the approximate
runner thread pool size, multiply the number of running pipelines by
2.2. Increasing this value does not increase the parallelization of an individual pipeline. Default is 50, which is sufficient to run approximately 22 standalone pipelines at the same time. For information about advanced thread pool properties, see Advanced Thread Pool Properties. |
runner.boot.pipeline.restart | Automatically restarts all running pipelines on a Data Collector restart. To disable the automatic restart of pipelines, uncomment this property. Disable only for troubleshooting or in a development environment. |
pipeline.max.runners.count | Maximum number of pipeline runners to use for a multithreaded pipeline. Default is 50. |
package.manager.repository.links | Enables specifying alternate locations for the Package Manager repositories. Use this
property to install non-StreamSets stage
libraries or to install stage libraries from local or alternate repositories. To use alternate Package Manager repositories, uncomment the property and specify a comma-separated list of URLs. |
bundle.upload.enabled | Enables uploading manually-generated support
bundles to customer support. When disabled, you can still generate, download, and email support bundles. To disable uploads of manually-generated bundles, uncomment this property. |
bundle.upload.on_error | Enables the automatic generation and upload of support
bundles to customer support when pipelines transition to an error state. Use of this property is not recommended. |
stage.alias.streamsets-datacollector-basic-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget=
streamsets-datacollector-jdbc-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget
library.alias.streamsets-datacollector-apache-kafka_0_8_1_1-lib=
streamsets-datacollector-apache-kafka_0_8_1-lib
Generally, you should not need to change
or remove these aliases.You can optionally add stage libraries to the following properties to limit the stage libraries Data Collector uses and include additional configuration files. The property names differ depending on the Data Collector version:
Blocklist / Allowlist Property | Description |
---|---|
system.stagelibs.allowlist system.stagelibs.blocklist |
Use one list to limit the IBM StreamSets stage libraries that can be used in Data Collector. Do not use both. |
user.stagelibs.allowlist user.stagelibs.blocklist |
Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both. |
Blacklist / Whitelist Property | Description |
---|---|
system.stagelibs.whitelist system.stagelibs.blacklist |
Use one list to limit the IBM StreamSets stage libraries that can be used in Data Collector. Do not use both. |
user.stagelibs.whitelist user.stagelibs.blacklist |
Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both. |
Classpath Validation Property | Description |
---|---|
stagelibs.classpath.validation.enable | Allows you to disable classpath validation when necessary. By default, Data Collector performs classpath validation each time it starts. It writes the results to the Data Collector log. Though
generally unnecessary, you can disable classpath validation by uncommenting this property and
setting it to |
stagelibs.classpath.validation.terminate | Prevents Data Collector
from starting when it discovers an invalid classpath. To use enable this behavior, uncomment this
property and set it to |
Health Inspector Property | Description |
---|---|
health_inspector.network.host | Host name that the Data Collector Health Inspector
uses for the ping and traceroute commands. |
The Data Collector configuration file includes the following property that specifies additional configuration files to include in the Data Collector configuration:
Additional Files Property | Description |
---|---|
config.includes | Additional configuration files
to include in the Data Collector configuration. The files must be stored in a directory relative to the
$SDC_CONF directory.You can enter multiple file names separated by commas. The files are loaded into the Data Collector configuration in the listed order. If the same configuration property is defined in multiple files, the value defined in the last loaded file takes precedence. By default, the dpm.properties, vault.properties, and credential-stores.properties files are included in the Data Collector configuration. |
The Data Collector configuration file includes record sampling properties that indicate the size of the sample set chosen from a total population of records. Data Collector uses the sampling properties when you run a pipeline that writes to a destination system using the SDC Record data format and then run another pipeline that reads from that same system using the SDC Record data format. Data Collector uses record sampling to calculate the time that a record stays in the intermediate destination.
By default, Data Collector uses 1 out of 10,000 records for sampling. If you modify the sampling size, simplify the fraction for better performance. For example, configure the sampling size as 1/40 records instead of 250/10000 records. The following properties specify the sampling size:
Record Sampling Property | Description |
---|---|
sdc.record.sampling.sample.size | Size of the sample set. Default is 1. |
sdc.record.sampling.population.size | Size of the total number of records. Default is 10,000. |
The Data Collector
configuration file includes properties that define how Data Collector caches pipeline states. Data Collector can cache the state
of pipelines for faster retrieval of those states in the Home page. If Data Collector does not cache pipeline states, it must retrieve pipeline states from the pipeline
data files stored in the $SDC_DATA
directory. You can configure the following
properties that specify how Data Collector caches pipeline states:
Pipeline State Cache Property | Description |
---|---|
store.pipeline.state.cache.maximum.size | Maximum number of pipeline states that Data Collector caches. When the
maximum number is reached, Data Collector evicts the oldest
states from the cache. Default is 100. |
store.pipeline.state.cache.expire.after.access | Amount of time in minutes that a pipeline state can remain in the cache after the
entry's creation, the most recent replacement of its value, or its last access. Default is 10 minutes. |