A mechanism for process logging and scheduling in database applications

Implementing process logging and scheduling mechanism can be a big challange in complex applications. Tracking the status of an ongoing process and its completion phase can be very important for an application that has many nested processes. Users often need to know about the restartability of processes that have an abnormal termination. Also complex processes and sub-processes may need to be scheduled in such a way that they can be automated and require minimal user intervention to start up. This article explains an easy-to-implement mechanism for meeting these requirements.

Share:

Arindam Chakraborty (arinchak@in.ibm.com), Information Architect, IBM

Arindam Chakraborty photo Arindam Chakraborty is a data architect in the Enterprise Application Development and Management Practice. Mr. Chakraborty has over 13 years experience in database design and consulting with emphasis on technical management; software lifecycle methodologies; database administration, performance, and architectural design; and application development and implementation. He has managed the relationships between Information Services and business unit senior management to plan, develop, and implement numerous applications providing solutions to real business problems. Specific areas of software expertise include an Oracle DBA and DB2 certification gained through competitive examination and driving with consulting approach following IBM proven methodologies. Mr. Chakraborty has expertise in data architecture, data modeling, enterprise data warehouse design and data mart, and database administration with performance optimization.



09 September 2010

Also available in Chinese

Introduction

Data transfer among multiple applications which comprise a variety of sources and involve numerous interdependent processes can be a big challenge to schedule and monitor with proper sequencing. Abnormally terminated processes often need to restart from their failover point to fulfill business requirements. Also, users often needs to receive information about the process history for audit tracking and monitoring.

This article describes a solution approach that has been proven to be extremely useful in meeting these requirements and challenges, and can be built with minimal resources. The solution can be build using IBM® DB2® relational database capabilities or other relational database systems. There is no need for any specific scheduler or expensive monitoring tool to be used to monitor or schedule the complex processes. The approach is extremely cost-effective and requires minimal skill to develop and maintain.

The business problem

Here are some of the business problems related to database processes and scheduling. These problems can affect profitability and prevent users from achieving their functional and operational goals.

  • Abnormal execution and completion of the internal database processes which cannot be tracked outside the database often need trained resources to understand and monitor them from either the database level or the operating system level.
  • Processes which may be completed successfully but require abnormal execution time, resulting in unexpected values, can be a real threat to the business. This may happen even within the lowest level of the dependent processes, resulting in operation failure and malfunction for the other major processes as well. Businesses need to be ensured of the perfect execution of those small dependent processes.
  • Multiple sources may pass data into various processes in different sequences and timelines. These processes must be managed and scheduled properly to avoid data inconsistency and overlapping of processes. This may be a big challenge in multi-format complex data integration projects, to achieve the proper scheduling.
  • Abnormal process termination or unsuccessful completion of a process is a common scenario in the data integration process flow. Data may need to be restored from the failover point either immediately through manual intervention, or automatically, based on the business requirements.

Use cases

The following use cases effectively answer these business problems.

  1. The first use case is, consulting the process history by participants of a process execution.
  2. The second use case is business activity monitoring. Business activity monitoring stores, queries, or analyze process execution logs to find useful statistical information about the processes. For example, this kind of monitoring can reveal how much time is spent on average in each step of the process and where the botttlenecks are.
  3. The next use case is the undo functionality and recovery of certain failed processes. Process logs can be used to implement the undo and recoverability of the unsuccessful processes.
  4. The last use case is the process scheduler, which answers the complex data load frequency.

To solve these problems, you must first identify the components based on their functional nature and designs the data model to store and retrieve essential data elements for the log. Once the components have been realized and the data model has been prepared, the process module can be developed and tested as using the operational and functional test cases. The developed module can be used as plug-in with the existing ETL system and used to insert or update data in the respective database entities. The logging operation starts with a process log footmark in the database and continues logging until the process completes, irrespective of the success or failure of the execution.

Architecture

The system architecture consists of three major components:

  • Process logging component: This fulfills use cases 1, 2, and 3.
    This component is responsible for generating process-related information for a specific process in a module. Every time a new process is initiated, this component logs its information with a predefined execution state. Until the process has been completed successfully or abandoned, the component stays alive and tries to log the latest state of the process in the database. The component interacts with the process scheduling component at the time of process initiation and also with the process error management component to log errors that occurred while processing.
  • Process and program scheduling component: This fulfills use case 4.
    This component controls the initiation sequence of the processes. This is a dependent component that interacts with the process logging component and executes the code associated with the process according to the schedule.
  • Process error management component: This fulfills use cases 2, 3 and 4.
    This component is responsible for generating and updating error data from the predefined error list and updating information to the process logging component.

Design and development

The solution of the problem is very effective especially in data integration and data transformation (ETL) applications. The solution is also effective in other message-driven integration projects where many processes are being executed at regular intervals. Figure 1 illustrates the solution data flows.

Figure 1. Data flow diagram
process starts and places an entry in the process_log tale. Then if succesful, it outputs data or process and sends process completed message back. If not sucessful, updates error log and sends message back to process_log.

The solution has two major parts:

  1. Designing the data model
  2. Designing and building the logic of the operation

The logical data model contains the following entities:

  • Process master
  • Process program
  • Process program schedule
  • Process log detail
  • Error master

Figure 2 illustrates the logical data model.

Figure 2. Diagram of the logical data model
Diagram shows relationship of process master to process program, to process program schedule, to process log. Process log also related to process error master

The following sections examine each of these entities.

Process master

This entity contains the information about all the processes available in the application. This is master information, stored in the table in hierarchical format. That is, one process can have a parent process and can also have a grandparent process as well. The process hierarchy needs to be defined accurately to ensure proper sequencing of the interlinked processes. Other information can be also useful to store, though not mandatory.

In a data warehousing system, numerous processes are involved, such as staging of the source data, data quality analysis, transformation and consolidation, loading data into the datamart, and so on. In order to define and distribute those processes into the Process Master, we can arrange data in the following manner:

Table 1. Process master
Process DescriptionProcess IDParent Process IDComments
Database Process0000000000Database process starts with "00000"
Source Data Staging Process1000000000Out of 3 major processes "Source Data Staging" is the first process whose Parent is "Database Process", this Process ID starts with "1" (10000)
Staging Source System -11100010000There are multiple source systems, and the "Source System -1" got Process ID "11000", which means, system can deal with 9000 source systems
Staging Source System -1 – Subsystem – 11110011000There are many Subsystems linked with the "Source System -1", and thus "Subsystem -1" got Process ID "11100", which means the system can deal upto 900 sub-systems
Staging Source System -1 – Subsystem – 1 – Europe Data1110111100This sub-system runs in different countries, and source data can be obtained from those countries as well, here the system can deal upto 90 countries.
Data Transformation Process2000000000Transformation process is the 2nd major process of the parent "Database Process", which starts the ID with "2"
Transform Source System -1 2100020000In the same way like "Staging processes", the Transformation processes can be distributed, all with prefix with "2" determines the "Transformation Process" for each of the "Staging Process"
Transform Source System -1 – Subsystem – 1 2110021000Child of 21000
Transform Source System -1 – Subsystem – 1 – Europe Data2110121100Child of 21100
Hyperion Module Process 9000090000This is a different independent module, and can be taken part in the process master (or can be attached with "Database Process" as well)
Hyperion Module Process – Data Manipulation Process -1 9100090000It has it's own process distribution, and can be scheduled and logged in the output stream as well
Table 2. Process Master Entity Structure
ColumnColumn DescriptionDataType and WidthDefault ValueMandatory or Optional
Process IDContains Unique Process Identification NumberChar 5NAM
Process Parent IDContains Parent Process ID for a specific process and establish a process hierarchiChar 5NAM
Process DescriptionDescription of the processVarchar 200NAM
Process SequenceSequence number of the process in it's hierarchy (say this process is the 5th of it's parent hierarchy, so the value would be "5")Char 2NAO
Process CL ParametersProcess Command Line argumentsVarchar 2000NAO
Process Max Error AllowedHow many error this process should allow (system can look up this number and based on that can terminate the process forcefully)Char 3999O
Application NameName of the Application for which this process has been definedVarchar 1000NAM
Module NameName of the Module for which this process has been definedVarchar 1000NAO
Process Logging FlagWhether the process needs to be loggedChar 1YM
Scheduler FlagWhether Scheduler attached with this processChar 1YM

Process program

A process can consist of one or more programs (code and data structure). Once a process is executed, the system starts searching for the associated programs from the database tables. (Note: This happens only if the scheduler has not been attached for this process. See Scheduler Flag in the Process Master table.) One process can be associated with multiple programs, so, this entity has one-to-many cardinality from process to program.

Table 3. Process-program entity structure
ColumnColumn DescriptionDataType and WidthDefault ValueMandatory or Optional
Process Program Surrogate KeyContains Surrogate key for this Entity IntegerNAM
Program NameContains name of the program (we can separately keep program master Entity and reference program ID here) Varchar 50NAM
Process IDProcess Identification NumberChar 5NAM
Program CL ParametersProgram Command Line argumentsVarchar 2000NAO

Process-Program-Schedule

Each individual program can be scheduled to be executed in some predefined sequence. This entity facilitates the program to be associated with one or more schedules, and also capable of defining and constructing the repetitive process-program information with a specific frequency. (For example, one program can be run at 5 minutes intervals.) This entity has one to many cardinality from Process-Program entity to Process-Program-Schedule entity.

Table 4. Process-Program-Schedule Entity Structure
ColumnColumn DescriptionDataType and WidthDefault ValueMandatory or Optional
Process Program Schedule Surrogate KeyContains Surrogate key for this EntityIntegerNAM
Process Program Surrogate KeyContains Surrogate key from Process Program (Foreign Key) IntegerNAM
Schedule YearContains YearVarchar 4NAM
Schedule MonthContains Month NameVarchar 20NAM
Schedule DayContains DayVarchar 2NAM
Schedule HourContains HourVarchar 2NAM
Schedule MinuteContains MinuteVarchar 2NAM
Repeat FrequencyThis field contains frequency information (i.e. #5 means this program will be executed every 5 minutes from the start time)Varchar 2NAM
Skip ScheduleWhether the program will not be executed in it's scheduleChar 1NAM
Trigger File FlagWhether this program expects a "Trigger File" from another applicationChar 1NO

Process logging

Once a process has been initiated, the system starts searching for the particular program from the Process-Program-Schedule and creates an entry in the Process Log table. The system controls the execution of a process based on the status of the process. In each execution step of a process, there is an attached specific status code, and the system updates the status code field on the basis of the execution status of the process itself. Different status codes are available for a particular process. Table 5 lists the status codes.

Table 5. Process Execution State - Status Code Example
Status CodeStatus DescriptionComments
0Failed with known error codeError Code can be obtained from Error Master
1Process startsWaiting to complete
2Process successfully endsOnly when the process ends without error
3Zero record processedTarget record count is zero
4Zero record in Source EntitySource count is zero
5Process Link ErrorProcess cannot be linked with it's Parent process (in the case of dependent process)
6Abnormal completion timeProcess success but taken abnormal time
7Process progression errorProcess initiated but next record read not happened
8Process restartedProcess restart code
9Failed with UNKNOWN errorError code can not be obtained
*Status of the associated parent process (only when its previous status was "2")Once a child or dependent process start to execute (only possible when it's parent process had been successfully completed and the status was "2") system converts it's parent process status from "2" to " * "
#Process in execution stateWhile executing of the dependent process system converts parent process status from "*" to "#"
$Parent process successOnce dependent process successfully completed, system converts its parent status from "#" to "$". That means no need to process that again.

Process logging algorithm

  1. Start a process
  2. Search the Process-Program-Schedule for confirmation about the schedule and get the program name.
    1. Get the Process ID
    2. Insert a new record in Process Log Entity
    3. Process Status Flag should start with "1"
    4. Execute the program with it's command line parameters
  3. Commit
  4. Upon successful completion of the process:
    1. Update the record with the following detail:
      "Process End Time" = (end timestamp)
      "Process Target Record Count" = (target table record count)
      "Process Status Flag" = "2" (if successful)
      "Process Execution Time" = (end time - start time)
      "Process Comments" = (comments with SQL ERROR code and description)
    2. If this is a dependent process, convert the "Parent Process" status flag from "2" to "*"
    3. Update this parent process status as per the list. This mechanism ensures restorability and continuation of the process and controls the dependent process execution as well.
  5. Upon unsuccessful completion:
    1. Update the same record
    2. "Status Code" should not be "2". Choose appropriate status code from the list.
  6. Commit
Table 6. Process Log Entity Structure
ColumnColumn DescriptionDataType and WidthDefault ValueMandatory or Optional
Process Log IDUniquely identify one Process Log record. This ID may be inserted in the target file / table (in a separate column per record basis) in order to control the next dependent process execution. The next process will access those records only from the table / file which has been processed and was waiting for further transformation. IntegerNAM
Process IDProcess ID which is getting processedChar 5NAM
Process Start TimeThe Start timestamp of the processTimestampNAM
Process Log IDUniquely identify one Process Log record. This ID may be inserted in the target file / table (in a separate column per record basis) in order to control the next dependent process execution. The next process will access those records only from the table / file which has been processed and was waiting for further transformation. IntegerNAM
Process End TimeThe End timestamp of the processTimestampNAM
Process Source File NameName of the file / table of the sourceVarchar 100NAO
Process Source Record CountThe start record count in the source file / table IntegerNAM
Process Target Record CountThe end record count of the processIntegerNAM
Process Status FlagThe status of the process (get from the status list)Varchar 2NAM
Process Start ModuleThe module which starts the process Varchar 100NAO
Process End ModuleThe module where process endsVarchar 100NAO
Process Execution TimeActual process execution time (may not be exactly deduction from end time and start time) DecimalNAM
Process User Log IDThe user who starts the processVarchar 50NAM
Process System StatusSystem (OS / Network) related informationVarchar 2000NAO

Process-Error

This entity has been defined the process related predefined error codes and the description, which are used in a process execution step. When an error occurrs, the system looks for the related error code from this table and updates the Process Log table according to the Process Log ID from the table.

Table 7. Process-Error Entity Structure
ColumnColumn DescriptionDataType and WidthDefault ValueMandatory or Optional
Error CodeContains Error CodeChar 5NAM
Error DescriptionContains Description of the errorVarchar 100NAM
Error TypeWhich type of error: "SQL", "SYSTEM", "DATA", "CODE"Varchar 10NAM
Critical Status5 = most critical, 1= lowest criticalChar 1NAO

Pseudo-code

A sample pseudo-code example is given below. This pseudo-code inserts record in the Process Log table.

Main Procedure "Process Log Insert"
   Define 5 input parameters
   1st parameter is the input "Process ID" [by searching right process from Process-
	Program-Schedule"  Entity ]			
   2nd parameter is the "Start Time" 
   3rd parameter is the "Start Module" 
   4th parameter is the "Mode" [either "Insert" or "Update"]
   5th parameter is the "Process Log ID" (optional - for control the next process 
	execution)
				
   >>Insert a new record in the "Process Log" Entity with the input parameter 
	specified above, use DB2 INSERT statement to insert the record.			
					
   >> Update existing record in the "Process Log"  Entity, use DB2 UPDATE statement
	 to update the record.

Notes

The most important part of the solution is to build the Process Master entity with the proper hierarchy of the processes. Once Process Master has been created successfully, you need to define the Process Program Scheduler data according to the running sequence and frequency of the processes. These two activities are the prerequisites for the overall solution.

Summary

You now should be able to track all processes for the whole application and also the individual nested fine-grained processes as well. This solution is a concept and can be built with any RDBMS (including DB2 or Informix) without having much development effort or resource utilization.

To get optimum performance, you need to create appropriate database indexes in the physical data model on the Process Log table. Periodic archiving of the table is also advisable. Another use of the solution is process restorability, if the application crashes and needs to be restarted. This can be performed based on the Status Code for that particular process entry in the Process Log table. In order to restart a process, the system needs to obtain the latest status code and detailed information about that process.

This solution doesn't offer any interface to load master data or generate report for the end user. The user needs to build them separately. Also, automatic restorability needs to be designed separately.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=516615
ArticleTitle=A mechanism for process logging and scheduling in database applications
publish-date=09092010