Preprocessing Teradata shell script wrappers of BTEQ & TPT scripts

Preprocessing shell script wrappers of BTEQ scripts

Issue

IBM Automatic Data Lineage requires pure BTEQ scripts as inputs. If (which is fairly common) BTEQ is wrapped in Linux shell scripts or Windows Batch files, e.g. as follows (the comments about the structure are usually not there):

# some shell script syntax that we want to get rid of.
# Any shell variables need to be resolved if they are used in pure BTEQ code
FIELD=$1
ENDDATE=`date +'%Y%m%d%H%M'`
echo "Started execution: "> /tmp/SP_WF_ANTERIOR.txt
date >> /tmp/SP_WF_ANTERIOR.txt

# start of embedded BTEQ code that we want to capture; there can be anything after <<, not just EOF
bteq <<EOF
.LOGON $FIELD
INSERT INTO T_$FIELD(a, b, c) SELECT d, e, f from T_Source where load_date > '$ENDDATE';

.LOGOFF
.quit;
EOF
# end of BTEQ code continued by some shell script syntax that we again want to get rid of

echo "Fin: ">> /tmp/SP_WF_ANTERIOR.txt
date >> /tmp/SP_WF_ANTERIOR.txt

then

Only the BTEQ part needs to be extracted
Any shell variables (pattern like $Xxxx or ${Xxxxx}) have to be replaced with some default value so that it passes the parsing. In this case at least $Campo (login username and also used as part of table name in SQL query) and $Fecha (date used in SQL query).
The resulting pure BTEQ syntax needs to be passed to Teradata scanner into cli/input/teradata/<teradata.id>/bteq folder.

There are multiple approaches possible how to implement this. Two solutions proposed below:

Solution using replace.csv

Automatic Data Lineage provide a built-in capability for find&replace Manta Flow Scanner (Client) Configuration for manually provided inputs.

The whole process of BTEQ extraction is achievable using bteqReplace.csv capability by specifying

"(?s).*bteq <<([^ ]+?)\\v(.*?)\\v\\1\\v.*","$2"
"[$]FIELD","FIELD"
"[$]ENDDATE","ENDDATE"

or (probably easier to read)

"(?sm).*bteq <<(.+?)$(.*?)^\\1$.*","$2"
"[$]FIELD","FIELD"
"[$]ENDDATE","ENDDATE"

and setting teradata.bteq.replace.regex=true in the Teradata connection or globally for the Teradata scanner. There may be more placeholders used in other scripts outside of the attached samples set.

Limitations

Note that the replace.csv approach above expects fairly fixed way how BTEQs are wrapped in shell scripts. It also requires providing values for all the Environment variables/placeholders; if they are created dynamically, this may require quite some effort. If there are more patterns or are used inconsistently, it must be reflected in the bteqReplace.csv as well.

The input shell scripts then need to go to cli/input/teradata/<dict.id>/bteq/[default database]
place the sample content of bteqReplace.csv above to cli/input/teradata/<dict.id>/bteqReplace.csv

Solution using dummy execution of the BTEQ scripts

To avoid having to provide values for all the environment variables manually, it is possible to use Linux shell to do the work for by simulating the Shell script/BTEQ execution without actually calling bteq utility by replacing it by /usr/bin/cat command. If we can replace input variables by dummy records (based on assumption that these are used for audit purposes only), then the procedure is following:

Replace shebang #!/bin/bash by #!/bin/bash -i to enable alias expansion
Add 2 lines after shebang
alias bteq="/usr/bin/cat"
export LOG_DIR=.
Fix any environment variables that have not been defined. For examples (this is coming from a specific script):
1. Replace export LOGFILE="${LOG_DIR}/MERCURY_DEALER_MASTER_RECORD_TYPE_LOAD.sh.$$.$(date +%Y%m%d).log" by export LOGFILE="${LOG_DIR}/MERCURY_DEALER_MASTER_RECORD_TYPE_LOAD.sql"
2. Based on required date format replace export P_EXTRACTDATE=$1 by export P_EXTRACTDATE='2001/09/01'
3. Remove line with source ./Activity_Count_Post_Processing_1.sh "$LOGFILE"
4. Remove lines calling CALL ${P_MERCURY_T_DB}.CREATEAUDITEVENT_ext - this is not interesting from lineage perspective
Run MERCURY_DEALER_MASTER_RECORD_TYPE_LOAD.sh which will create MERCURY_DEALER_MASTER_RECORD_TYPE_LOAD.sql with pure BTEQ syntax
Copy MERCURY_DEALER_MASTER_RECORD_TYPE_LOAD.sql to cli/input/teradata/<dict.id>/bteq

All steps can be fully automated by executing some_preprocess.sh prior to Automatic Data Lineage scan.

Preprocessing Job Variables in TPT

Issue

In TPT scripts, Job variables https://docs.teradata.com/r/nYrvpj~lD1VtAtciDfToww/rlawMc2CvccT_8hUgXpuLg can be used, e.g. @TGT_WRK_DB in

STEP droping_table
(
APPLY
('DROP TABLE ' || @TGT_WRK_DB || '. R_SUB_SOC_WRK;')
TO OPERATOR (DDL_OPERATOR_1);
);

If the job variables are not defined in the TPT script itself but supplied externally during runtime, they need to be preprocessed e.g. using replace.csv functionality of Teradata TPT scanner.