Manta Insights Overview

Manta Insights (formerly known as Lighthouse) is an analytical extension of the Manta Flow Server that allows you to evaluate specific scenarios identifying important data patterns. You can run these scenarios either directly using the API endpoints or you can use a user interface that allows you to specify all the necessary input parameters for each rule.

Direct usage of Lighthouse API endpoint is documented in Manta Insights API Usage. This topic describe the usage of Insights via its graphical user interface.

The scenarios can be accessed via Landing Page URL http://localhost:8080/manta-dataflow-server/viewer/landing-page, from where you can access either Flow Server Homepage or trigger Insights scenarios as can be seen at the screenshot below.

No alt text provided

If you select "Always open the repository at launch”, then this URL http://localhost:8080/manta-dataflow-server/viewer/ will automatically redirects you to http://localhost:8080/manta-dataflow-server/viewer/repository URL with Manta Repository Homepage. However, the landing page can be opened later on either by clicking at IBM Manta Data Lineage logo header in Repository Homepage or by direct access to Landing Page URL: http://localhost:8080/manta-dataflow-server/viewer/landing-page

Note: You must have the LIGHTHOUSE_READ privilege to use Manta Insights.

Insights rules for deeper data analysis

Following scenarios (a.k.a. "rules") are available and can be used to obtain important insights about your data architecture:

Depending on the rule, Insights scans assets at various hierarchy levels: tables, databases, views, reports, columns, etc. Each rule focuses on a single type of pattern in the metadata storage and its flexibility is assured by using various input parameters. These parameters can be common and apply to all rules, or they can be rule-specific. Most of the parameters provide default values.

In general, there are following steps when some rule is triggered:

  1. definition of the scope (i.e. selecting which revision and part of the repository should be analysed)

  2. providing additional input parameters which influence the rule’s behaviour

  3. displaying the result page with the nodes matching the criteria

While first step is common to all rules, the remaining two are specific to each rule.

Let’s describe each scenario in more detail below.

Highly Connected Elements

Highly-connected assets are important parts of data flow with a large number of flow edges in the selected direction. These assets may be critical points of data lineage read by many other procedures such as database tables that contain mappings from usernames to customers’ names. Therefore, the owner or administrator should guarantee that this table contains verified data. Otherwise, incorrect values will leak into other parts of the lineage. Also, the user can determine which database procedures are used the most and carefully inspect them to guarantee their accuracy without any bugs.

For performance reasons, the administrator may duplicate a highly-connected table to allow load balancing and improve efficiency when reading database transactions.

This rules counts number of flow from / to the elements in the scope and provides a list of the elements which has the most data flows.

Example

The following image shows the data lineage of the detected asset. The number of flows 5, which is returned by this rule, represents the five data flows from (numbered 1 and 2) or to (numbered 3, 4, and 5) the other assets.

No alt text provided

Step 1 - defining rule scope

As a first step in the rule, you should decide which revision and part of the repository should be analysed. You can either run the scenario on your whole repository, which might be time-consuming in case of complex data architecture, or you can restrict the scope and run it against a specific part of the repository. All Insights rules work only with assets in layers of type Physical, so there is no need to specify the layer.

If you don't need to evaluate the rule for some parts of the repository or if you want to limit the scope to speed up the evaluation, it is recommended to select only specific Resources or even particular Node sub-hierarchies.

Selection of the revision and the repository part you are interested in can be easily done by the selection tree below:

No alt text provided

Only node Oracle/ORCL/DWH and its sub-hierarchy will be evaluated by the Insights rule.

Step 2 - selecting edge type and direction

Another input parameter which is common for all three rules is edge type. You can run the scenarios either for DIRECT edges only or for both DIRECT and FILTER flows. Then, after you choose the edge type, you can decide if you want to count only incoming / outgoing or both directions of edges.

Both this selections can be done at the second step of the input dialog below:

No alt text provided

Step 3 - additional input parameters

No alt text provided

As explained above, this rule detects elements with a large number of flow edges in the selected direction (or in both directions) and sorts them from largest to smallest. The assets are evaluated on the second-lowest hierarchy level; for example, Tables (the lowest level is Column), Files (the lowest level is File Column), or database procedures.

However, there is a way how to influence if the rule returns number of columns the edges lead from / to or number of unique Tables (or 2nd level nodes in general). This can done by parameter Count Mode having following values:

Another input parameter is Number of flows which sets the threshold (T) and only the nodes having more than (T) flows are returned in the result. If you want to limit number of the nodes shown in the result, then use the parameter Number of results.

And last but not least, you can select which Element types (a.k.a. Node types) you want to analyse. In the selection box which is at the bottom part of the dialog, only the Element types which were found in the selected repository scope are offered. So if you have not selected e.g. MSSQL resource, then no MSSQL elements types are offered.

After you specified all the input parameters, you can click at "Run Scan" and wait for the result to appear at the screen.

Displaying the rule result

The result of the rule is displayed at a new page which has a form of a table listing the Nodes which are matching the input criteria:

No alt text provided

This image shows Highly Connected Elements result page with the list of elements with the highest number of flows.

This table can be easily filtered, sorted and exported into CSV.

If you select a node from the list, you can either visualise the node in the Viewer (button Visualize in a graph) or automatically locate the Node in the Repository Search window (button See element in repository).

In case you need to go back at the Landing page and try another rule or input parameters combination, just click at the top left Manta Data Lineage logo.

Isolated Elements

Isolated Elements (a.k.a. Isolated Lineage Segments) are collections of independent elements (Nodes) isolated from the rest of the lineage. Throughout development, there are often assets in the data environment that are not used. For instance, for various reasons, the data is only in legacy systems that are not utilised anymore. Or it may be the case that the development of a particular application was scratched, but its remnants are still in the environment.

Isolated elements are assets at various levels of hierarchy — of the type Table, Database, Directory, etc. The assets of the isolated lineage segment may have flows to other assets of the segment but no incoming or outgoing flows to or from assets outside the current isolated segment.

So, for example, an isolated table is an asset of the type Table that may have flow between table columns but doesn't have any flows from or to columns in other tables.

Step 1 - defining rule scope

As a first step it is necessary to define the scope of the rule, which is done in the same way like for Highly Connected Elements (HCE) rule. See previous example.

Step 2 - selecting edge type, isolation mode and element types to be analysed

No alt text provided

Similarly to previous HCE rule, you need to select if the rule will analyse only DIRECT or both DIRECT & FILTER edges by selecting proper edge types icon at the top of the dialog.

The next parameter Isolated Mode specifies if you want to list Nodes which are isolated from incoming, outgoing or both directions (i.e. completely isolated).

Eventually, in the last Element type parameter you can specify the rule will consider only selected types of the Elements. Again, only the element types which are relevant for the selected resource in the first step are offered here.

Displaying the rule result

The result of the rule is displayed at a new page which has the same form and features as for the previous Highly Connected Elements rule, offering the list which can be exported and filtered. If a node is selected in the table, some detailed information can be found at the right side of the window.

No alt text provided

This image shows Isolated Elements result page.

Independent Flows (as of R42.3)

This rule finds the dependencies between particular groups of data flows. If you understand the data flows in your environment, you can plan logical schedules and parallelize some parts of the DWH load, achieving higher effectiveness.

Because it does not make sense to compare all possible data flows of the lineage (there are many), it is important to define what flows are supposed to be analyzed. The specification of one flow could be a single ETL job of a given type. One job is typically a sequence of commands and transformations, which moves data from a source asset (e.g., a table in database) to a target asset and manipulates it on the way.

The default configuration might take quite a significant time to run. It is recommended to limit the scope of the rule by selecting less Resources/Nodes in the Step 1 below.

Step 1 - defining rule scope

Similarly to other rules, as a first step it is necessary to define the scope of the rule. If you select nodes from more than 1 Resource, you will get warning that computation might take more time. Other than that, you can still select as many Resources as needed, Insight will not block you in any way:

No alt text provided

Step 2 - selecting edge type and hierarchy level to analyse

No alt text provided

Similarly to previous rules, you need to select if the rule will analyse only DIRECT or both DIRECT & FILTER edges by selecting proper edge types icon at the top of the dialog.

The next parameter Hierarchy Level defines at which hierarchy level the dependence is evaluated. The options ar Column level, Object Level (Table, File, View etc.) and Group level (Database, Schema, Directory, etc.).

Step 3 - selecting Dependency mode and Job types

No alt text provided

Eventually, in the last part of the input dialog, you can select which type of dependency you want to analyse and which Job types you are interested in.

Possible dependencies between particular jobs are as follows:

Displaying the rule result

The result of the rule is displayed at a new page which has the same form and features as for the previous rules, offering the list which can be exported and filtered. If a node is selected in the table, some detailed information can be found at the right side of the window.

The resulting table is a list of Start Jobs and for each you can expand all dependent jobs including information about dependency type.

FAQ

Error: Out of Heap Space Memory

Although we are constantly improving Lighthouse algorithms, the evaluation may fail because the heap space memory ran out while evaluating a large repository. Here are a few things you can try.

  1. Limit Lighthouse to only run on the part of the repository specified by paths; for example, by adding the input parameter target-paths = /Oracle, /Talend/prod.

  2. Allocate more memory to the Server component and restart Server. See Manta Flow Memory Settings: Server.

  3. Submit a ticket and attach the following log files.

Cannot Visualize the Result from the Highly-Connected Elements Rule

After getting a response with the assets detected by the highly-connected elements rule, it is impossible to visualize it in Manta Viewer. Sometimes these assets (especially those with high count values) cannot be visualized (or it takes a long time) in Manta Viewer because the assets have too many connections to other assets. In such cases, it helps to locate the node in Repository Search / Homepage and then change the visualization parameters by doing something like decreasing the Steps Displayed or ticking off Visualize Indirect Flows.