OpenLineage mappings tutorial

This tutorial shows on an example scenario how mappings can be used to address the problem of incomplete lineage.

The goal of this tutorial is to create OpenLineage mappings to visualize data from OpenLineage events on the lineage. As a result, the lineage, which is incomplete and not accurate before you complete the steps, shows a correct end-to-end flow of data.

In this tutorial, the following OpenLineage event is used:

{
  "eventTime": "2025-08-21T10:03:33.616Z",
  "schemaURL": "https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent",
  "eventType": "COMPLETE",
  "job": {
    "namespace": "custom_etl_tool",
    "name": "workspace_name/folder_1/folder_2/job_name_1"
  },
  "inputs": [
    {
      "namespace": "s3://mybigbucket.com",
      "name": "sales/public/orders",
      "facets": {
        "schema": {
          "fields": [
            {
              "name": "my_one_field",
              "type": "integer"
            }
          ]
        }
      }
    }
  ],
  "outputs": [
    {
      "namespace": "mongodb://analytics-db.company.com:27017",
      "name": "customerdb.mycollection.sales_summary",
      "facets": {
        "schema": {
          "fields": [
            {
              "name": "my_one_field",
              "type": "integer"
            }
          ]
        }
      }
    }
  ]
}

This event contains three sections:

Job
This section represents the process that connects the two datasets. It has the following elements:

  • Namespace: custom_etl_tool
  • Name: workspace_name/folder_1/folder_2/job_name_1 This job uses a custom ETL tool. It is not clear on what technology the job is running (for example Python, or Java), so a mapping with a custom technology is required.

Input
This section represents the datasets that are read. It has the following elements:

  • Namespace: s3://mybigbucket.com
  • Name: sales/public/orders This dataset uses a known and supported technology, Amazon S3.

Output
This section represents the datasets that are written. It has the following elements:

  • Namespace: mongodb://analytics-db.company.com:27017
  • Name: customerdb.mycollection.sales_summary This dataset uses a known, but unsupported technology, MongoDB, so a mapping is required.

Prerequisites

The following prerequisites are required:

  • Data lineage must be configured.
  • A project is required where you have the Admin or Editor role.
  • You must have the Manage data lineage permission to create technologies and mappings.
  • The following data source definitions must be created:
    • A data source definition of the OpenLineage type, with the host customETLtoolHost.
    • A data source definition of the OpenLineage type, with the host analytics-db.company.com and port 27017.
    • A data source definition of the Amazon S3 type, with the host s3://mybigbucket.com.
  • Prepare your OpenLineage event. The event must be added in the JSON format and compressed to a .zip file.

You work with mappings in Data > Data lineage > Map lineage > Map OpenLineage.

1. Optional: Review the incomplete lineage

To better understand how mappings improve the quality of lineage, import the OpenLineage event before the mappings are created.

  1. In the project, go to the Assets tab, and click New asset > Import metadata for data assets.
  2. Provide a name for your import, for example OpenLineage example event.
  3. Select the Import lineage metadata goal.
  4. In the data source section, select the OpenLineage data source definition with the customETLtoolHost host.
  5. In the Add inputs from file section, click Add. Upload your .zip file with the OpenLineage event.
  6. Optionally, define other metadata import options.
  7. Save your changes.

As a result, data lineage is not accurate. Namespaces are shown as raw strings, asset structure is not correct, and placeholders of unknown types are displayed.

2. Review a mapping for the input dataset

Review the following section of the event:

{
  "namespace": "s3://mybigbucket.com",
  "name": "sales/public/orders"
  .....
}

The namespace contains a prefix s3://, which corresponds to Amazon S3. The name field suggests that the data structure is folder/folder/file. As Amazon S3 is a known technology, a default mapping is already created for it. You can select it from the list in the Active mappings tab and check review the mapping configuration. Search for the s3:// mapping.

This mapping has the following configuration:

  • Mapping conditions:
    • Rule scope: Datasets.
    • Matching method: Namespace prefix.
    • Namespace prefix: s3://
  • Mapping actions:
    • Technology type: Amazon S3 (default type, already defined).
    • Data source definition: Assigned automatically

3. Create a mapping for the job

Review the following section of the event:

{
  "namespace": "custom_etl_tool",
  "name": "workspace_name/folder_1/folder_2/job_name_1"
  ....
}

As it is a custom ETL tool, you need to create a mapping.

Complete these steps:

  1. On the Active mappings tab, click Create mapping.
  2. In the Rule scope section, you need to select the type of the namespace. The custom ETL tool is referenced in the job section of the event, so this mapping rule is based on the job namespace. Select Job namespace.
  3. Then, decide on the namespace matching method. The value of this job namespace in the event is custom_etl_tool. This value is static and it does not contain a hostname. You need to provide an entire namespace value. Select the Namespace exact value, and enter custom_etl_tool in the value field.
  4. The custom ETL tool technology does not exist. On the next page, in the technology type section, click Select > New technology. Provide the following details:
    • Technology name: Custom ETL tool
    • Branch name: Default
    • Technology type: ETL tool
    • Asset hierarchy number of levels: 3
    • Hierarchy level names: Workspace, Folder, Job
    • Recursive asset level: Folder
  5. Select a data source definition. Search for the OpenLineage data source definition with host customETLtoolHost that you created earlier.
  6. Save the mapping.

4. Create a mapping for the output dataset

Review the following section of the event:

{
  "namespace": "mongodb://analytics-db.company.com:27017",
  "name": "customerdb.mycollection.sales_summary"
  ....
}

The namespace contains a prefix mongodb://, which corresponds to MongoDB. The name field suggests that the data structure is database/collection/document. Such technology type is not supported by default, so a new custom technology is required.

Create a mapping for this dataset:

  1. On the Active mappings tab, click Create mapping.
  2. In the Rule scope section, select Dataset namespace (inputs, outputs).
  3. The namespace contains a prefix mongodb://, and host and port values. Select Namespace prefix as the matching method. In the Namespace prefix field, Enter mongodb://.
  4. On the next page, in the technology type section, click Select > New technology. Provide the following details:
    • Technology name: MongoDB
    • Branch name: Doc
    • Technology type: Database
    • Asset hierarchy number of levels: 3
    • Hierarchy level names: Database, Collection, Document
    • Recursive asset level: None
  5. In the data source definition section, select option Assign automatically. Based on the static prefix value, and dynamic host and port values, all data is associated with the OpenLineage data source definition with host analytics-db.company.com and port 27017.

5. Run metadata import

Create and run a metadata import with the example OpenLineage event to see how the lineage is changed when the new mapping rules are processed.

  1. In the project, go to the Assets tab, and click New asset > Import metadata for data assets.
  2. Provide a name for your import, for example OpenLineage example event.
  3. Select the Import lineage metadata goal.
  4. In the data source section, select the OpenLineage data source definition with the customETLtoolHost host.
  5. In the Add inputs from file section, click Add. Upload your .zip file with the OpenLineage event.
  6. Optionally, define other metadata import options.
  7. Save your changes.

6. Review the lineage

When the new mappings are processed, the lineage contains accurate data:

  • Amazon S3 dataset is resolved into the structure Bucket > Folder > File.
  • MongoDB dataset is resolved into the structure Database > Collection > Document.
  • The custom ETL tool jobs are added under a logical structure Workspace > Folder > Folder > Job.