OpenLineage integration failure

This article explains how to resolve issues that may arise during the integration of OpenLineage and IBM Automatic Data Lineage. Make sure you have already completed the preliminary steps and entered all the necessary prerequisite information: OpenLineage integration requirements.

Problem

OpenLineage is unable to transmit OpenLineage events to the configured server. You may see an error message similar to:

io.openlineage.client.OpenLineageClientException: java.net.SocketTimeoutException: Read timed out

Solution

  1. Check that the OpenLineage server can communicate with the Automatic Data Lineage server by using the ping command: ping <servername>.com.

    Note: Make sure to modify the server_name to point to your server.
  2. If the two machines are able to communicate with each other successfully, run a curl command from the OpenLineage Server to post a sample event. You can use this example curl command:

    curl -X POST http://{server_name}:5000/listeners/openlineage/ap/v1/lineage  -i -H 'Content-Type: application/json'   -d  '{
     "eventType": "COMPLETE",
     "eventTime": "2024-09-27T09:48:11.309Z",
     "run": {
       "runId": "019232e1-719c-7e9e-8b64-fb9522345678",
       "facets": {
         "nominalTime": null,
         "parent": {
           "_producer": "https://github.com/OpenLineage",
           "_schemaURL": "https://openlineage.io/spec/fa",
           "run": {
             "runId": "019232e1-719c-7e9e-8b64-fb9522345678"
           },
           "job": {
             "namespace": "concurrent-testing",
             "name": "combined_experiment3"
           }
         },
         "spark_properties": {
           "_producer": "https://github.com/OpenLineage",
           "_schemaURL": "https://openlineage.io/spec/2-",
           "properties": {
             "spark.master": "spark://spark-master-headless-",
             "spark.app.name": "combined-experiment3"
           }
         },
         "processing_engine": {
           "name": "spark",
           "version": "3.4.2",
           "_producer": "https://github.com/OpenLineage",
           "_schemaURL": "https://openlineage.io/spec/fa",
           "openlineageAdapterVersion": "1.19.0"
         },
         "environment-properties": {
           "_producer": "https://github.com/OpenLineage",
           "_schemaURL": "https://openlineage.io/spec/2-",
           "environment-properties": {}
         }
       }
     },
     "job": {
       "namespace": "concurrent-testing",
       "name": "perpetual_process_test_payload",
       "facets": {
         "documentation": null,
         "sourceCodeLocation": null,
         "sql": null,
         "jobType": {
           "_producer": "https://github.com/OpenLineage",
           "_schemaURL": "https://openlineage.io/spec/fa",
           "processingType": "BATCH",
           "integration": "SPARK",
           "jobType": "SQL_JOB"
         }
       }
     },
     "inputs": [],
     "outputs": [
       {
         "namespace": "s3://wxd-demo-bucket",
         "name": "playback-db-warehouse/yellow_t",
         "facets": {
           "documentation": null,
           "schema": {
             "_producer": "https://github.com/OpenLineage",
             "_schemaURL": "https://openlineage.io/spec/fa",
             "fields": [
               {
                 "name": "VendorID",
                 "type": "long",
                 "description": null
               },
               {
                 "name": "passenger_count",
                 "type": "double",
                 "description": null
               }
             ]
           },
           "dataSource": {},
           "description": null,
           "lifecycleStateChange": {
             "_producer": "https://github.com/OpenLineage",
             "_schemaURL": "https://openlineage.io/spec/fa",
             "lifecycleStateChange": "ALTER"
           },
           "columnLineage": null,
           "symlinks": {
             "_producer": "https://github.com/OpenLineage",
             "_schemaURL": "https://openlineage.io/spec/fa",
             "identifiers": [
               {
                 "namespace": "hive://source_database",
                 "name": "playbackdb.yellow_taxi_2022",
                 "type": "TABLE"
               }
             ]
           },
           "version": {
             "_producer": "https://github.com/OpenLineage",
             "_schemaURL": "https://openlineage.io/spec/fa",
             "datasetVersion": "2294925621186125182"
           }
         },
         "inputFacets": null,
         "outputFacets": {}
       }
     ],
     "producer": "https://github.com/OpenLineage",
     "schemaURL": "https://openlineage.io/spec/2-"
    }'
    
  3. a. If the curl command fails, rerun it with the -v (verbose mode enabled) to generate more information on the potential causes of the issue, or run the curl command from the Automatic Data Lineage server to validate that it's able to Post to the OpenLineage perpetual process successfully.

    b. If the curl command is successful from the local machine but not from the OpenLineage server, the issue is most likely due to the network. Check if there is a proxy in between the Automatic Data Lineage and the OpenLineage server. If there is a proxy, validate that the endpoint being used points to the perpetual process port running on Automatic Data Lineage (the default port number is 5000). Alternatively, validate that there are no firewall rules preventing the two machines from communicating.

Alternative solution

Testing via manually provided files is another option to consider when troubleshooting the Openlineage scanner. OpenLineage provides a configuration option to output the generated events to a local file system. You can find the configuration documentation in https://openlineage.io/docs/client/java/configuration.

  1. Configure OpenLineage to output the generated events to a local directory.

  2. Copy those files to the input directory for OpenLineage on Automatic Data Lineage. You can find the documentation for the input directory in OpenLineage Manual Inputs.