Targeted scanning

Overview

From R42.5, Targeted scans are enabled by default for new installations.

The Targeted scans feature allows the rescanning of one (or more) connections while keeping the lineage of all other already scanned connections intact. Targeted scanning brings the much-needed flexibility of what and when can be scanned without the need to rescan everything all the time.

Targeted scans improve legacy minor revisions and effectively replace legacy scanning modes of major scans.

Here is an example, where we know there have been some recent changes in a BI Tool (for example Power BI) and we want to see these changes in the lineage generated by IBM Automatic Data Lineage. All we need to do is run a Targeted scan containing only the corresponding single Power BI connection. After the scan finishes, the Automatic Data Lineage lineage will show all these new Power BI changes, it will still be an end-to-end lineage connected to its source systems and the lineage will contain all other previously scanned connections.

In R42.0-R42.4, Targeted scans are not enabled by default. The default behavior is the same as in the prior versions of Automatic Data Lineage.

Prerequisites

As Targeted scans are a complex feature that changes a lot, there are a few steps you need to take before using this feature:

  1. Make sure that you have at least 30% of your disc space free (on the machine where Automatic Data Lineage runs).

    • We need to store additional data to make Targeted scans possible, which requires more disc space.
  2. Enable the feature.

    • In the standard install of Automatic Data Lineage, Open Manta AdminUI, go to the Configuration pane, choose CLI, then Common, and finally Common config. After opening that page, scroll down and in the Targeted scanning section change the value of Enable targeted scanning to true.

    • In dockerized Automatic Data Lineage, instead, add -Dmanta.incremental.updates.enabled=true to MANTA_CLI_PROCESSOPTIONS in docker-compose.yml.

  3. R42.0-R42.2: Start with an empty repository. This is mandatory for Automatic Data Lineage to precompute necessary data for all future Targeted scans. Start with a clean/empty repository by running a clean repository scenario - note that this will drop all the historical lineage that you have

  4. R42.3 and newer: Start with an empty revision. Before running any Targeted scan, you need a clean major revision. This is mandatory for Automatic Data Lineage to precompute necessary data for all future Targeted scans. There are two options to achieve this:

    • Start with a clean/empty repository by running a clean repository scenario - note that this will drop all the historical lineage that you have

    • Run a single major scan beforehand (e.g. New Revision and Commit Revision scenarios). This step needs to be done after enabling the Targeted scans feature in step (2), and even if you did a major scan before enabling the feature.

    • This step can be omitted if your Manta database is empty (i.e. the current revision is 0).

  5. Set the Revision type to Minor in all workflows (new or existing) that should use Targeted scans:

    No alt text provided

New workflows are created with Major revision by default in R42 so this is up to the user to make sure to switch them to minor ones.

Best practices

Here are some tips that will help you maximize the benefits of Targeted scans:

Deleting nodes and edges for deleted connections

For R42.4 and earlier, it is not possible to remove nodes for deleted connection only. The only solution is to clean the whole repository and start from the beginning.

For R42.5 and newer, Automatic Data Lineage automatically attempts to delete objects associated with the deleted connection. This is done by running Run Clean Connection Data scenario. You can also run it manually in case the automated run got terminated unexpectedly The Run Clean Connection Data workflow has two modes:

Limitations: The second mode used as a backup option does not work correctly for connections with the same name and scanner technology as connections that were already deleted. For example, you create a PowerBI connection with the name power-1, you execute the analysis, and then delete the connection. Then, you decide to create a new PowerBI connection again with the same name, you execute the analysis, and then delete the connection. When you terminate the Run Clean Connection Data workflow before it can finish and then try to clean the visualization with the backup workflow, the nodes and edges of the power-1 connection will be still visible in the visualization. To delete the power-1 nodes in such case, create a connection with the same name and scanner technology, delete it, and wait for the automatically executed workflow to successfully finish.

Example when order of scans is important

We have two connections: the first is an Oracle database connection, the second is a Power BI connection that loads data from the Oracle database. Both connections have some changes and we want to rescan both of them. What we can do:

  1. Incorrect solution: Run a Targeted scan of only the Power BI connection, then run a Targeted scan of only the Oracle connection.

    • The first Power BI scan uses data from the Oracle connection. However, if it’s run first, the data of the Oracle connections aren’t yet updated. Therefore the first Power BI scan will use outdated Oracle data and then the second Oracle scan will update these data. Consequently, the resulting Power BI lineage might be imprecise, because it used outdated Oracle data.
  2. Correct solution 1: Run a Targeted scan of only the Oracle connection, then run a Targeted scan of only the Power BI connection.

    1. This is the correct order. The reason is that when analyzing the Power BI, its source Oracle connection is already updated and its up-to-date data are used. The resulting lineage will be correct.
  3. Correct solution 2: Run a Targeted scan that contains both the Oracle connection and the Power BI connection.

    1. This is also a correct solution because Automatic Data Lineage first updates (extracts) data for both connections and then analyzes both connections. Consequently, Power BI will see up-to-date Oracle data and the resulting lineage will be correct.

Limitations

There are some limitations of this feature to consider:

Example

Here is an example usage of Targeted scans:

Revision

Technologies scanned

Notes

0.1

Oracle

First Targeted scan was done for an Oracle database. The Manta repository now contains data lineage for the Oracle database.

0.2

Power BI

Power BI was scanned using another Targeted scan. The Manta repository now contains data lineage for the Oracle database and the Power BI connection.

0.3

Power BI

Admin found that connection alias mappings were needed because, without them, the lineage between Oracle and Power BI did not properly connect in the previous revision 0.2.

After creating alias mappings, another Targeted scan targeting only Power BI was done. This scan replaced the previously created Power BI lineage from revision 0.2 and replaced it with the correct Power BI lineage that is correctly connected to the source Oracle lineage.

0.4

Oracle + Power BI

After a few days, both technologies were changed by their developers. Therefore, the admin ran a Targeted scan containing both technologies to update their lienage. This scan replaced lineage from previous revisions for both technologies.

An alternative approach to this step would be to first run a Targeted scan for Oracle and then run a Targeted scan for Power BI (but not the other way around as described in the previous section).

0.5

Power BI

After a day, new changes were done only in Power BI. This time a Targeted scan targeting only Power BI is enough to update the resulting lineage.

This scan replaced the Power BI lineage from revision 0.4 while keeping the Oracle lineage from that same revision.