Troubleshooting
Accessing error messages
Informational and error messages display in different locations based on the type of information:
- Flow configuration issues
- The flow canvas provides guidance and error details as follows:
- Issues found by implicit validation display in the Issues list.
- An error icon displays at the stage where the problem occurs or on the canvas for flow configuration issues.
- Error record information
- You can use the Error Records flow properties to write error records and related details to another system for review. The information in the record header attributes can help you determine the problem that occurred. For more information, see Internal attributes.
Flow basics
Use the following tips for help with flow basics:
- Why isn't the Run icon enabled?
- You can start a flow when it is valid. Use the Issues icon to review the list of issues in your flow. When you resolve the issues, the Run icon becomes enabled.
General validation errors
- The flow has the following set of validation errors for a stage:
-
CONTAINER_0901 - Could not find stage definition for <stage library name>:<stage name>. CREATION_006 - Stage definition not found. Library <stage library name>. Stage <stage name>. Version <version> VALIDATION_0006 - Stage definition does not exist, library <stage library name>, name <stage name>, version <version>
Sources
Use the following tips for help with source stages and systems.
Directory
- Why isn't the Directory source reading all of my files?
- The Directory source reads a set of files based on the configured file name pattern, read order, and first file to process. If new files arrive after the Directory source has passed their position in the read order, the Directory source does not read the files unless you reset the source.
Elasticsearch
- A flow with an Elasticsearch source fails to start with an SSL/TLS error, such as the following:
-
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
JDBC sources
- My MySQL JDBC Driver 5.0 fails to validate the query in my JBDC Query Consumer source.
- This can occur when you use a LIMIT clause in your query.
- I'm using a JDBC source to read MySQL data. Why are datetime value set to zero being treated like error records?
- MySQL treats invalid dates as an exception, so both the JDBC Query Consumer and the JDBC Multitable Consumer create error records for invalid dates.
- A flow using the JDBC Query Consumer source keeps stopping with the following error:
-
JDBC_77 <db error message> attempting to execute query '<query>'. Giving up after <error count> errors as per stage configuration. First error: <first db error>. - My flow using a JDBC source generates an out-of-memory error when reading a large table.
- When the Auto Commit property is enabled in a JDBC source, some drivers ignore the fetch-size restriction, configured by the Max Batch Size property in the source. This can lead to an out-of-memory error when reading a large table that cannot entirely fit in memory.
Oracle CDC Client
- Data preview continually times out for my Oracle CDC Client flow.
- Flows that use the Oracle CDC Client can take longer than expected
to initiate for data preview. If preview times out, try increasing the Preview Timeout property
incrementally.
For more information about using preview with this source, see Data preview with Oracle CDC Client.
- My Oracle CDC Client flow has paused processing during a daylight saving time change.
- If the source is configured to use a database time zone that uses daylight saving time, then the flow pauses processing during the time change window to ensure that all data is correctly processed. After the time change completes, the flow resumes processing at the last-saved offset.
PostgreSQL CDC Client
- A PostgreSQL CDC Client flow generates the following error:
-
com.streamsets.pipeline.api.StageException: JDBC_606 - Wal Sender is not active
Salesforce
- A flow generates a buffering capacity error
- When flows
with a Salesforce source fail due to a buffering capacity error, such as
Buffering capacity 1048576 exceeded, increase the buffer size by editing the Streaming Buffer Size property on the Subscribe tab.
Scripting sources
- A flow fails to stop when manually stopped
- Scripts must include code that stops the script when users stop the flow. In the script, use the
sdc.isStoppedmethod to check whether the flow has been stopped. - A Jython script does not proceed beyond import lock
- Flows freeze if Jython scripts do not release
the import lock upon a failure or error. When a script does not release an import lock, you must
restart Data Collector to
release the lock. To avoid the problem, use a
trystatement with afinallyblock in the Jython script. For more information, see Thread safety in Jython scripts.
SQL Server CDC Client
- Previewing data does not show any values
- When you set the Maximum Transaction Length property, the source fetches data in
multiple time windows. The property determines the size of each time window. Previewing data only
shows data from the first time window, but the source might need to process multiple
time windows before finding changed values to show in the preview.
To see values when previewing data, increase Maximum Transaction Length or set to -1 to fetch data in one time window.
- A no-more-data event is generated before reading all changes
- When you set the Maximum Transaction Length property, the source fetches data in multiple time windows. The property determines the size of each time window. After processing all available rows in each time window, the source generates a no-more-data event, even when subsequent time windows remain for processing.
Processors
Use the following tip for help with processors.
Encrypt and Decrypt Fields
- The following error message displays in the log after I start the flow:
-
CONTAINER_0701 - Stage 'EncryptandDecryptFields_01' initialization error: java.lang.IllegalArgumentException: Input byte array has incorrect ending byte at 44
Targets
Use the following tips for help with target stages and systems.
Azure Data Lake Storage
- An Azure Data Lake Storage target seems to be causing out of memory errors, with the following object using all available memory:
-
com.streamsets.pipeline.stage.destination.hdfs.writer.ActiveRecordWriters
Cassandra
- Why is the flow failing entire batches when only a few records have a problem?
- Due to Cassandra requirements, when you write to a Cassandra cluster, batches are atomic. This means than an error in a one or more records causes the entire batch to fail.
- Why is all of my data being sent to error? Every batch is failing.
- When every batch fails, you might have a data type mismatch. Cassandra requires the data type of the data to exactly match the data type of the Cassandra column.
Elasticsearch target
- A flow with an Elasticsearch target fails to start with an SSL/TLS error, such as the following:
-
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
Kafka Producer
- Can the Kafka Producer create topics?
- The Kafka Producer can create a topic when all of the following are true:
- You configure the Kafka Producer to write to a topic name that does not exist.
- At least one of the Kafka brokers defined for the Kafka Producer has the auto.create.topics.enable property enabled.
- The broker with the enabled property is up and available when the Kafka Producer looks for the topic.
- A flow that writes to Kafka keeps failing and restarting in an endless cycle.
- This can happen when the flow tries to write message to Kafka 0.8 that is longer than the Kafka maximum message size.
JDBC connections
Use the following tips for help with stages that use JDBC connections to connect to databases. For some stages, Data Collector includes the necessary JDBC driver to connect to the database. For other stages, you must install a JDBC driver.
- JDBC Multitable Consumer source
- JDBC Query Consumer source
- MySQL Binary Log source
- Oracle Bulkload source
- Oracle CDC source
- Oracle CDC Client source
- Oracle Multitable Consumer source
- Oracle target
- SAP HANA Query Consumer source
- JDBC Lookup processor
- JDBC Tee processor
- SQL Parser processor, when using the database to resolve the schema
- JDBC Producer target
- JDBC Query executor
No suitable driver
JDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException:
JDBC_06 - Failed to initialize connection pool: java.sql.SQLException: No suitable driverVerify that you have followed the instructions to install additional drivers, as explained in Install external libraries.
You can also use these additional tips to help resolve the issue:
- The JDBC connection string is not correct.
- The JDBC Connection String property for the stage must
include the
jdbc:prefix. For example, a PostgreSQL connection string might bejdbc:postgresql://<database host>/<database name>. - The external resource archive file containing the JDBC driver is not set up correctly.
- When you include the JDBC driver in an external resource archive file, the archive file must use the required folder names and directory structure. For details about the required archive file structure, see Archive structure.
- JDBC drivers do not load or register correctly.
- Sometimes JDBC drivers that a flow requires do not load or
register correctly. For example a JDBC driver might not correctly support JDBC 4.0 auto-loading,
resulting in a
No suitable drivererror message.Two approaches can resolve this issue:- Add the class name for the driver in the JDBC Class Driver Name property on the Legacy Drivers tab for the stage.
- Configure Data Collector
to automatically load specific drivers. In the Data Collector
configuration properties, uncomment the
stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.loadproperty and set to a comma-separated list of the JDBC drivers required by stages in your flows.
Cannot connect to database
When Data Collector cannot connect to the database, an error message like the following displays - the exact message can vary depending on the driver:
JDBC_00 - Cannot connect to specified database: com.zaxxer.hikari.pool.PoolInitializationException:
Exception during pool initialization: The TCP/IP connection to the host 1.2.3.4, port 1234 has failed
$ ping 1.2.3.4
PING 1.2.3.4 (1.2.3.4): 56 data bytes
64 bytes from 1.2.3.4: icmp_seq=0 ttl=57 time=12.063 ms
64 bytes from 1.2.3.4: icmp_seq=1 ttl=57 time=11.356 ms
64 bytes from 1.2.3.4: icmp_seq=2 ttl=57 time=11.626 ms
^C
--- 1.2.3.4 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.356/11.682/12.063/0.291 ms$ nc -v -z -w2 1.2.3.4 1234
nc: connectx to 1.2.3.4 port 1234 (tcp) failed: Connection refusedIf the host or port is not accessible, check the routing and firewall configuration.
MySQL JDBC driver and time values
Due to a MySQL JDBC driver issue, the driver cannot return time values to the millisecond. Instead, the driver returns the values to the second.
For example, if a column has a value of 20:12:50.581, the driver reads the value as 20:12:50.000.
Performance
- How can I decrease the delay between reads from the source system?
- A long delay can occur between reads from the source system when a flow reads records faster than it can process them or write them to the target system. Because a flow processes one batch at a time, the flow must wait until a batch is committed to the target system before reading the next batch, preventing the flow from reading at a steady rate. Reading data at a steady rate provides better performance than reading sporadically.
- When I try to start one or more flows, I receive an error that not enough threads are available
- By default, Data Collector
can run approximately 22 standalone flows at the same time. If you run a
larger number of standalone flows at the same time, you might receive the
following
error:
CONTAINER_0166 - Cannot start flow '<flow name>' as there are not enough threads available - How can I improve the general flow performance?
- You might improve performance by adjusting the batch size used by the flow. The batch size determines how much data passes through the flow at one time. By default, the batch size is 1000 records.