Manta Insights Overview
Manta Insights (formerly known as Lighthouse) is an analytical extension of the Manta Flow Server that allows you to evaluate specific scenarios identifying important data patterns. You can run these scenarios either directly using the API endpoints or you can use a user interface that allows you to specify all the necessary input parameters for each rule.
Direct usage of Lighthouse API endpoint is documented in Manta Insights API Usage. This topic describe the usage of Insights via its graphical user interface.
The scenarios can be accessed via Landing Page URL http://localhost:8080/manta-dataflow-server/viewer/landing-page
, from where you can access either Flow Server Homepage or trigger Insights scenarios
as can be seen at the screenshot below.
If you select "Always open the repository at launch”, then this URL http://localhost:8080/manta-dataflow-server/viewer/ will automatically redirects you to http://localhost:8080/manta-dataflow-server/viewer/repository URL with Manta Repository Homepage. However, the landing page can be opened later on either by clicking at IBM Manta Data Lineage logo header in Repository Homepage or by direct access to Landing Page URL: http://localhost:8080/manta-dataflow-server/viewer/landing-page
LIGHTHOUSE_READ
privilege to use Manta Insights.
Insights rules for deeper data analysis
Following scenarios (a.k.a. "rules") are available and can be used to obtain important insights about your data architecture:
-
Highly Connected Elements (as of R42.2)
-
Isolated Elements (as of R42.2)
-
Independent Flows (as of R42.3)
Depending on the rule, Insights scans assets at various hierarchy levels: tables, databases, views, reports, columns, etc. Each rule focuses on a single type of pattern in the metadata storage and its flexibility is assured by using various input parameters. These parameters can be common and apply to all rules, or they can be rule-specific. Most of the parameters provide default values.
In general, there are following steps when some rule is triggered:
-
definition of the scope (i.e. selecting which revision and part of the repository should be analysed)
-
providing additional input parameters which influence the rule’s behaviour
-
displaying the result page with the nodes matching the criteria
While first step is common to all rules, the remaining two are specific to each rule.
Let’s describe each scenario in more detail below.
Highly Connected Elements
Highly-connected assets are important parts of data flow with a large number of flow edges in the selected direction. These assets may be critical points of data lineage read by many other procedures such as database tables that contain mappings from usernames to customers’ names. Therefore, the owner or administrator should guarantee that this table contains verified data. Otherwise, incorrect values will leak into other parts of the lineage. Also, the user can determine which database procedures are used the most and carefully inspect them to guarantee their accuracy without any bugs.
For performance reasons, the administrator may duplicate a highly-connected table to allow load balancing and improve efficiency when reading database transactions.
Example
The following image shows the data lineage of the detected asset. The number of flows 5, which is returned by this rule, represents the five data flows from (numbered 1 and 2) or to (numbered 3, 4, and 5) the other assets.
Step 1 - defining rule scope
As a first step in the rule, you should decide which revision and part of the repository should be analysed. You can either run the scenario on your whole repository, which might be time-consuming in case of complex data architecture, or you can
restrict the scope and run it against a specific part of the repository. All Insights rules work only with assets in layers of type Physical
, so there is no need to specify the layer.
Selection of the revision and the repository part you are interested in can be easily done by the selection tree below:
Only node Oracle/ORCL/DWH
and its sub-hierarchy will be evaluated by the Insights rule.
Step 2 - selecting edge type and direction
Another input parameter which is common for all three rules is edge type. You can run the scenarios either for DIRECT edges only or for both DIRECT and FILTER flows. Then, after you choose the edge type, you can decide if you want to count only incoming / outgoing or both directions of edges.
Both this selections can be done at the second step of the input dialog below:
Step 3 - additional input parameters
As explained above, this rule detects elements with a large number of flow edges in the selected direction (or in both directions) and sorts them from largest to smallest. The assets are evaluated on the second-lowest hierarchy level; for example, Tables (the lowest level is Column), Files (the lowest level is File Column), or database procedures.
However, there is a way how to influence if the rule returns number of columns the edges lead from / to or number of unique Tables (or 2nd level nodes in general). This can done by parameter Count Mode having following values:
-
Table level - for all elements in the selected scope, count the number of unique tables (or other types on the same hierarchy level) the flow goes from / to.
-
Column level - for all elements in the selected scope, count the number of columns (or other types on the same hierarchy level) the flow goes from / to.
Another input parameter is Number of flows which sets the threshold (T) and only the nodes having more than (T) flows are returned in the result. If you want to limit number of the nodes shown in the result, then use the parameter Number of results.
And last but not least, you can select which Element types (a.k.a. Node types) you want to analyse. In the selection box which is at the bottom part of the dialog, only the Element types which were found in the selected repository scope are offered. So if you have not selected e.g. MSSQL resource, then no MSSQL elements types are offered.
After you specified all the input parameters, you can click at "Run Scan" and wait for the result to appear at the screen.
Displaying the rule result
The result of the rule is displayed at a new page which has a form of a table listing the Nodes which are matching the input criteria:
This image shows Highly Connected Elements result page with the list of elements with the highest number of flows.
This table can be easily filtered, sorted and exported into CSV.
If you select a node from the list, you can either visualise the node in the Viewer (button Visualize in a graph
) or automatically locate the Node in the Repository Search window (button
See element in repository
).
In case you need to go back at the Landing page and try another rule or input parameters combination, just click at the top left Manta Data Lineage logo.
Isolated Elements
Isolated Elements (a.k.a. Isolated Lineage Segments) are collections of independent elements (Nodes) isolated from the rest of the lineage. Throughout development, there are often assets in the data environment that are not used. For instance, for various reasons, the data is only in legacy systems that are not utilised anymore. Or it may be the case that the development of a particular application was scratched, but its remnants are still in the environment.
Isolated elements are assets at various levels of hierarchy — of the type Table
, Database
, Directory
, etc. The assets of the isolated lineage segment may have flows to other assets of the segment but no incoming
or outgoing flows to or from assets outside the current isolated segment.
So, for example, an isolated table is an asset of the type Table
that may have flow between table columns but doesn't have any flows from or to columns in other tables.
-
The rule detects assets having no outside flow edges in the following directions:
-
INCOMING
direction — A segment with no incoming flow edges is expected to be a source system. However, sometimes the asset without external incoming edges is not a source system, so the data in this segment is static. Although this asset should be externally updated, it is not. Hence, further investigation should be done to find the cause. -
OUTGOING
direction — A segment with no outgoing flow edges may be a reporting system or a dead asset. The term dead, in this context, refers to an asset that has no outgoing flow edges and is not reporting a system. Data from dead assets is not populated and, hence, should not be used anymore. -
BOTH
directions — A fully isolated segment is completely independent, which may signal either an independent storage architecture or the uselessness of a given asset. If it is redundant, you can safely remove that whole part of the system without affecting the other systems.
-
Step 1 - defining rule scope
As a first step it is necessary to define the scope of the rule, which is done in the same way like for Highly Connected Elements (HCE) rule. See previous example.
Step 2 - selecting edge type, isolation mode and element types to be analysed
Similarly to previous HCE rule, you need to select if the rule will analyse only DIRECT or both DIRECT & FILTER edges by selecting proper edge types icon at the top of the dialog.
The next parameter Isolated Mode specifies if you want to list Nodes which are isolated from incoming, outgoing or both directions (i.e. completely isolated).
Eventually, in the last Element type parameter you can specify the rule will consider only selected types of the Elements. Again, only the element types which are relevant for the selected resource in the first step are offered here.
Displaying the rule result
The result of the rule is displayed at a new page which has the same form and features as for the previous Highly Connected Elements rule, offering the list which can be exported and filtered. If a node is selected in the table, some detailed information can be found at the right side of the window.
This image shows Isolated Elements result page.
Independent Flows (as of R42.3)
This rule finds the dependencies between particular groups of data flows. If you understand the data flows in your environment, you can plan logical schedules and parallelize some parts of the DWH load, achieving higher effectiveness.
Because it does not make sense to compare all possible data flows of the lineage (there are many), it is important to define what flows are supposed to be analyzed. The specification of one flow could be a single ETL job of a given type. One job is typically a sequence of commands and transformations, which moves data from a source asset (e.g., a table in database) to a target asset and manipulates it on the way.
Step 1 - defining rule scope
Similarly to other rules, as a first step it is necessary to define the scope of the rule. If you select nodes from more than 1 Resource, you will get warning that computation might take more time. Other than that, you can still select as many Resources as needed, Insight will not block you in any way:
Step 2 - selecting edge type and hierarchy level to analyse
Similarly to previous rules, you need to select if the rule will analyse only DIRECT or both DIRECT & FILTER edges by selecting proper edge types icon at the top of the dialog.
The next parameter Hierarchy Level defines at which hierarchy level the dependence is evaluated. The options ar Column level, Object Level (Table, File, View etc.) and Group level (Database, Schema, Directory, etc.).
Step 3 - selecting Dependency mode and Job types
Eventually, in the last part of the input dialog, you can select which type of dependency you want to analyse and which Job types you are interested in.
Possible dependencies between particular jobs are as follows:
-
WRITE DEPENDENCY
— The jobs are reported if they write to the same target asset. -
READ DEPENDENCY
— The jobs are reported if they read from the same source asset. -
PROCESS DEPENDENCY
— The jobs are reported if one writes to the target asset from which the other reads. -
ANY DEPENDENCY
— Performs all three previous dependencies and combines the results. -
NO DEPENDENCY
— The jobs are reported if they are entirely independent of each other.
Displaying the rule result
The result of the rule is displayed at a new page which has the same form and features as for the previous rules, offering the list which can be exported and filtered. If a node is selected in the table, some detailed information can be found at the right side of the window.
The resulting table is a list of Start Jobs and for each you can expand all dependent jobs including information about dependency type.
FAQ
Error: Out of Heap Space Memory
Although we are constantly improving Lighthouse algorithms, the evaluation may fail because the heap space memory ran out while evaluating a large repository. Here are a few things you can try.
-
Limit Lighthouse to only run on the part of the repository specified by paths; for example, by adding the input parameter
target-paths = /Oracle, /Talend/prod
. -
Allocate more memory to the Server component and restart Server. See Manta Flow Memory Settings: Server.
-
Submit a ticket and attach the following log files.
-
<MANTA_HOME>/server/logs/manta-dataflow.log
-
Cannot Visualize the Result from the Highly-Connected Elements Rule
After getting a response with the assets detected by the highly-connected elements rule, it is impossible to visualize it in Manta Viewer. Sometimes these assets (especially those with high count
values) cannot be visualized (or it
takes a long time) in Manta Viewer because the assets have too many connections to other assets. In such cases, it helps to locate the node in Repository Search / Homepage and then change the visualization parameters by doing something like decreasing
the Steps Displayed or ticking off Visualize Indirect
Flows.