Targeted scanning
Overview
The Targeted scans feature allows the rescanning of one (or more) connections while keeping the lineage of all other already scanned connections intact. Targeted scanning brings the much-needed flexibility of what and when can be scanned without the need to rescan everything all the time.
Targeted scans improve legacy minor revisions and effectively replace legacy scanning modes of major scans.
Here is an example, where we know there have been some recent changes in a BI Tool (for example Power BI) and we want to see these changes in the lineage generated by IBM Automatic Data Lineage. All we need to do is run a Targeted scan containing only the corresponding single Power BI connection. After the scan finishes, the Automatic Data Lineage lineage will show all these new Power BI changes, it will still be an end-to-end lineage connected to its source systems and the lineage will contain all other previously scanned connections.
In R42.0-R42.4, Targeted scans are not enabled by default. The default behavior is the same as in the prior versions of Automatic Data Lineage.
Prerequisites
As Targeted scans are a complex feature that changes a lot, there are a few steps you need to take before using this feature:
-
Make sure that you have at least 30% of your disc space free (on the machine where Automatic Data Lineage runs).
- We need to store additional data to make Targeted scans possible, which requires more disc space.
-
Enable the feature.
-
In the standard install of Automatic Data Lineage, Open Manta AdminUI, go to the Configuration pane, choose CLI, then Common, and finally Common config. After opening that page, scroll down and in the Targeted scanning section change the value of Enable targeted scanning to true.
-
In dockerized Automatic Data Lineage, instead, add
-Dmanta.incremental.updates.enabled=truetoMANTA_CLI_PROCESSOPTIONSindocker-compose.yml.
-
-
R42.0-R42.2: Start with an empty repository. This is mandatory for Automatic Data Lineage to precompute necessary data for all future Targeted scans. Start with a clean/empty repository by running a clean repository scenario - note that this will drop all the historical lineage that you have
-
R42.3 and newer: Start with an empty revision. Before running any Targeted scan, you need a clean major revision. This is mandatory for Automatic Data Lineage to precompute necessary data for all future Targeted scans. There are two options to achieve this:
-
Start with a clean/empty repository by running a clean repository scenario - note that this will drop all the historical lineage that you have
-
Run a single major scan beforehand (e.g. New Revision and Commit Revision scenarios). This step needs to be done after enabling the Targeted scans feature in step (2), and even if you did a major scan before enabling the feature.
-
This step can be omitted if your Manta database is empty (i.e. the current revision is 0).
-
-
Set the Revision type to Minor in all workflows (new or existing) that should use Targeted scans:

Best practices
Here are some tips that will help you maximize the benefits of Targeted scans:
-
We recommend using only Targeted scans as of R42. Targeted scans are the way to go for the future. They can do everything the old major scans could do and much more.
-
If you would like to enable Targeted scans, we recommend coordinating this action with our support team to make sure the process is seamless. If you have a test environment, we recommend to first enable the feature there.
-
When performing multiple Targeted scans, the order of scans might be important.
-
The general recommended order is to scan databases first, and then scan ETLs, BIs, and programming languages.
-
Whenever scanning a dependent connection that uses data from another source connection (e.g. an ETL tool depending on a database), the Targeted scan of the dependent connection uses the newest extracted data of the source connection. However, if these data are outdated, there might be some inconsistencies in the resulting lineage.
-
Other solutions if such an issue appears are:
-
analyzing the tightly dependent connections together in a single Targeted scan, or
-
first, extract all connections, then run multiple Targeted scans that only analyze these connections (without extracting them).
-
-
Deleting nodes and edges for deleted connections
For R42.4 and earlier, it is not possible to remove nodes for deleted connection only. The only solution is to clean the whole repository and start from the beginning.
For R42.5 and newer, Automatic Data Lineage automatically attempts to delete objects associated with the deleted connection. This is done by running Run Clean Connection Data scenario. You can also run it manually in case the automated run got terminated unexpectedly The Run Clean Connection Data workflow has two modes:
- Mode 1: When you delete a connection through the connection configurator, Admin UI adds the new workflow to the workflow queue in the Process Manager. As a result, nodes and edges are deleted only for the currently deleted connection. Use this mode in most of the cases.
- Mode 2: In some cases, workflow started while deleting a connection might be terminated, or Automatic Data Lineage might be shut down unexpectedly before workflow finishes. This mode can be used as a backup option. In this case, the Run Clean Connection Data workflow can be run in the Execute Workflow window in the Process Manager. Similarly, as in the first option, the workflow creates a new minor revision and then commits the changes. The difference is that nodes and edges are deleted for all deleted connections, not only for the currently deleted connections.
Example when order of scans is important
We have two connections: the first is an Oracle database connection, the second is a Power BI connection that loads data from the Oracle database. Both connections have some changes and we want to rescan both of them. What we can do:
-
Incorrect solution: Run a Targeted scan of only the Power BI connection, then run a Targeted scan of only the Oracle connection.
- The first Power BI scan uses data from the Oracle connection. However, if it’s run first, the data of the Oracle connections aren’t yet updated. Therefore the first Power BI scan will use outdated Oracle data and then the second Oracle scan will update these data. Consequently, the resulting Power BI lineage might be imprecise, because it used outdated Oracle data.
-
Correct solution 1: Run a Targeted scan of only the Oracle connection, then run a Targeted scan of only the Power BI connection.
- This is the correct order. The reason is that when analyzing the Power BI, its source Oracle connection is already updated and its up-to-date data are used. The resulting lineage will be correct.
-
Correct solution 2: Run a Targeted scan that contains both the Oracle connection and the Power BI connection.
- This is also a correct solution because Automatic Data Lineage first updates (extracts) data for both connections and then analyzes both connections. Consequently, Power BI will see up-to-date Oracle data and the resulting lineage will be correct.
Limitations
There are some limitations of this feature to consider:
-
Performance of Targeted scans
- We need to compute and store additional data to make Targeted scanning possible, therefore we expect a slowdown of Manta scans by 5-15%.
-
Interpolation
-
If you’re using Interpolation, then there is an additional step required. Using the Advanced mode in the workflow designer, you need to add an interpolation scenario to all workflows that contain analysis of:
-
a data modeling scanner (i.e. ER Studio, Erwin, and PowerDesigner in R42),
-
a connection that is modeled by these data modeling scanners,
-
a connection with the manual import of data from any non-physical layer.
-
-
-
Disabling targeted scanning.
-
In case the targeted scans need to be disabled (this should not be needed), it is necessary to
-
upgrade to R42.3 (then it should work fine without any issues)
-
or start again with empty major revision or clean repository.
-
-
Example
Here is an example usage of Targeted scans:
|
Revision |
Technologies scanned |
Notes |
|---|---|---|
|
0.1 |
Oracle |
First Targeted scan was done for an Oracle database. The Manta repository now contains data lineage for the Oracle database. |
|
0.2 |
Power BI |
Power BI was scanned using another Targeted scan. The Manta repository now contains data lineage for the Oracle database and the Power BI connection. |
|
0.3 |
Power BI |
Admin found that connection alias mappings were needed because, without them, the lineage between Oracle and Power BI did not properly connect in the previous revision 0.2. After creating alias mappings, another Targeted scan targeting only Power BI was done. This scan replaced the previously created Power BI lineage from revision 0.2 and replaced it with the correct Power BI lineage that is correctly connected to the source Oracle lineage. |
|
0.4 |
Oracle + Power BI |
After a few days, both technologies were changed by their developers. Therefore, the admin ran a Targeted scan containing both technologies to update their lienage. This scan replaced lineage from previous revisions for both technologies. An alternative approach to this step would be to first run a Targeted scan for Oracle and then run a Targeted scan for Power BI (but not the other way around as described in the previous section). |
|
0.5 |
Power BI |
After a day, new changes were done only in Power BI. This time a Targeted scan targeting only Power BI is enough to update the resulting lineage. This scan replaced the Power BI lineage from revision 0.4 while keeping the Oracle lineage from that same revision. |