How to Pre-Process Oracle and SQL*Plus Scripts

This is a guide to using IBM Automatic Data Lineage to scan environments leveraging the SQL*Plus command line tool (executing SQL and PL/SQL code against an Oracle database).

Automatic Data Lineage supports scanning SQL and PL/SQL statements/scripts — but not SQL*Plus or shell or batch scripts with embedded PL/SQL code. Consequently, pre-processing logic is needed to calibrate the scripts for Automatic Data Lineage native scanners. One of the key tenants of Automatic Data Lineage is automation, as we believe it is the most efficient way of managing physical metadata. You can use the guidelines below, as well as a set of pre-processing scripts (i.e., Linux shell scripting), to design an automated approach tailored to your specific environment.

How to Identify SQL*Plus Scripts

SQLPlus scripts can be identified based on the use of specific commands that are not part of PL/SQL but only available in SQL\Plus. You can use the SQL*Plus User’s Guide and Reference as a reference. The most commonly used command is DEFINE, which defines a variable that can be used later on using the syntax &variable_name..

For example:

-- the following two DEFINE commands are SQL*Plus specific syntax
define schema_mid                     = medid_dm
define tblspace_stg_mid               = medid_stg_data

-- use of the variables &schema_mid. and &tblspace_stg_mid. are SQL*Plus specific syntax
create table &schema_mid..intr_4m_member_stg
tablespace &tblspace_stg_mid nologging compress parallel &parallel_high
as 
select  /*+ parallel(&parallel_high)*/  
  mbr.member_id                   
, mbr.unique_member_id        
, mbr.subscriber_id   
from &schema_mid..intr_4m_membership mbr';

Approach to Analyzing Data Lineage

To analyze the lineage of such scripts with Automatic Data Lineage, we recommend doing the following.

Any code in the scripts that has no impact on lineage can simply be commented out.
Currently, unsupported conventions such as Oracle data pump scripts will need to be captured via Automatic Data Lineage custom metadata import files.
Variable name declarations (all variables in the script preceded by an ampersand “&“ and not in single quotes) can be inserted into Automatic Data Lineage replace.csv file (Manta Flow Scanner (Client) Configuration: Placeholder Replacement in Input Scripts).
- Ideally, the process of generating the replace.csv file should be automated to minimize manual work. However, this is highly dependent on the consistency of the coding practices applied when creating such scripts in the past.
- Simple static variable definitions such as
```
define schema_mid                     = medid_dm
```
  in the scripts can be automatically extracted from the scripts themselves and converted into Automatic Data Lineage replace.csv file. Here’s a naive implementation of such pre-processing automation.
  
  generate_replacecsv_for_variables_in_sqlplus_script.sh
  - Also, multi-level resolution needs to be taken into account.
```
DEFINE schema = dwh 
DEFINE suffix_src = prod
DEFINE myschema = &schema
DEFINE target = &schema._.&suffix_src
```
Remove/replace any SQL*Plus-specific syntax affecting lineage; for example, the concatenation operator “.“
Inline any scripts called from the analyzed script to ensure proper lineage analysis.
For “one-offs” and/or unrecognized syntax/conventions, Automatic Data Lineage can recommend overall best practices in their script code so that they (and those implemented in the future) are compatible with the Automatic Data Lineage platform. Such cases can be identified during future/ongoing script analysis.

Approach to Analyzing Process Lineage

Process lineage is not a concept generally captured by Automatic Data Lineage OOTB scanners, but it can be nicely modeled using metadata import via Open Manta Extensions. The high-level idea is to capture the calls and inclusions of one script to another to capture the dependency. We recommend creating a script that:

Scans for all SQL*Plus script calls (“@“) and converts them into Automatic Data Lineage custom metadata.
Includes any additional required custom metadata attributes in scripts in the form of comments.
Utilizes pre-processing logic that will then scrape out these comments and insert them into the required Automatic Data Lineage custom metadata import (CSV) files that Automatic Data Lineage can readily ingest.

Example

The preprocessing using replace.csv (Manta Flow Scanner (Client) Configuration: Placeholder Replacement in Input Scripts) file for the sample script above might look as follows with processing the patterns as regular expressions set to enabled:

&schema_mid[.],medid_dm
&tblspace_stg_mid[.],medid_stg_data
(?i)^DEFINE,-- DEFINE