OpenLineage integration failure
This article explains how to resolve issues that may arise during the integration of OpenLineage and IBM Automatic Data Lineage. Make sure you have already completed the preliminary steps and entered all the necessary prerequisite information: OpenLineage integration requirements.
Problem
OpenLineage is unable to transmit OpenLineage events to the configured server. You may see an error message similar to:
io.openlineage.client.OpenLineageClientException: java.net.SocketTimeoutException: Read timed out
Solution
-
Check that the OpenLineage server can communicate with the Automatic Data Lineage server by using the
ping
command:ping <servername>.com
.Note: Make sure to modify theserver_name
to point to your server. -
If the two machines are able to communicate with each other successfully, run a curl command from the OpenLineage Server to post a sample event. You can use this example curl command:
curl -X POST http://{server_name}:5000/listeners/openlineage/ap/v1/lineage -i -H 'Content-Type: application/json' -d '{ "eventType": "COMPLETE", "eventTime": "2024-09-27T09:48:11.309Z", "run": { "runId": "019232e1-719c-7e9e-8b64-fb9522345678", "facets": { "nominalTime": null, "parent": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "run": { "runId": "019232e1-719c-7e9e-8b64-fb9522345678" }, "job": { "namespace": "concurrent-testing", "name": "combined_experiment3" } }, "spark_properties": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/2-", "properties": { "spark.master": "spark://spark-master-headless-", "spark.app.name": "combined-experiment3" } }, "processing_engine": { "name": "spark", "version": "3.4.2", "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "openlineageAdapterVersion": "1.19.0" }, "environment-properties": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/2-", "environment-properties": {} } } }, "job": { "namespace": "concurrent-testing", "name": "perpetual_process_test_payload", "facets": { "documentation": null, "sourceCodeLocation": null, "sql": null, "jobType": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "processingType": "BATCH", "integration": "SPARK", "jobType": "SQL_JOB" } } }, "inputs": [], "outputs": [ { "namespace": "s3://wxd-demo-bucket", "name": "playback-db-warehouse/yellow_t", "facets": { "documentation": null, "schema": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "fields": [ { "name": "VendorID", "type": "long", "description": null }, { "name": "passenger_count", "type": "double", "description": null } ] }, "dataSource": {}, "description": null, "lifecycleStateChange": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "lifecycleStateChange": "ALTER" }, "columnLineage": null, "symlinks": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "identifiers": [ { "namespace": "hive://source_database", "name": "playbackdb.yellow_taxi_2022", "type": "TABLE" } ] }, "version": { "_producer": "https://github.com/OpenLineage", "_schemaURL": "https://openlineage.io/spec/fa", "datasetVersion": "2294925621186125182" } }, "inputFacets": null, "outputFacets": {} } ], "producer": "https://github.com/OpenLineage", "schemaURL": "https://openlineage.io/spec/2-" }'
-
a. If the
curl
command fails, rerun it with the-v
(verbose mode enabled) to generate more information on the potential causes of the issue, or run thecurl
command from the Automatic Data Lineage server to validate that it's able toPost
to the OpenLineage perpetual process successfully.b. If the
curl
command is successful from the local machine but not from the OpenLineage server, the issue is most likely due to the network. Check if there is a proxy in between the Automatic Data Lineage and the OpenLineage server. If there is a proxy, validate that the endpoint being used points to the perpetual process port running on Automatic Data Lineage (the default port number is5000
). Alternatively, validate that there are no firewall rules preventing the two machines from communicating.
Alternative solution
Testing via manually provided files is another option to consider when troubleshooting the Openlineage scanner. OpenLineage provides a configuration option to output the generated events to a local file system. You can find the configuration documentation in https://openlineage.io/docs/client/java/configuration.
-
Configure OpenLineage to output the generated events to a local directory.
-
Copy those files to the input directory for OpenLineage on Automatic Data Lineage. You can find the documentation for the input directory in OpenLineage Manual Inputs.