Advanced topics for DB2 Data Warehouse Edition users, Part 1: Crash recovery utility for DB2 Data Warehouse Edition SQL Warehousing Tool

Recovery for process instances and runtime metadata

Protect your data warehouse environment using a new, downloadable crash recovery utility. With this tool, perform health checks and recover any inconsistent IBM® DB2® Data Warehouse Edition SQL Warehousing runtime metadata caused by unhandled interrupts. After the tool corrects the inconsistent metadata, the interrupted process instances are available for restart or termination.

Share:

Marion Behnen (mbehnen@us.ibm.com), DB2 Data Warehouse Edition SQL Warehousing Development, IBM, Software Group

Marion Behnen is a Senior Software Engineer within the IBM Software Group and works as a technical lead for the Data Warehouse Edition. She has worked as an architect and designer for the SQL Warehousing Runtime and Administration, which included the runtime metadata and statistics. She is a member of the DWE architecture board and participates in future DWE product strategy. She has more than 20 years of experience with database application development, business reporting, data warehousing, and business process integration. Prior to joining IBM, she was involved with many aspects of business process and data integration, in particular for the manufacturing industry.



Cheung-Yuk Wu (wu@us.ibm.com), DB2 Data Warehouse Edition SQL Warehousing Development, IBM, Software Group

Cheung-Yuk Wu has over 15 years of relational database tool and Business Intelligence application development experience in DB2, Oracle, Sybase, Microsoft SQL Server and Informix on Windows and UNIX platforms at IBM. She was the architect for the IBM DB2 Data Warehouse Center and Warehouse Manager. She also developed products including Tivoli for DB2, IBM Data Hub for UNIX, and QMF. She was also a DBA for DB2, CICS and IMS at the IBM San Jose Manufacturing Data Center. She received her Bachelor of Science degree in Computer Science from the California Polytechnic State University, San Luis Obispo.



18 January 2007

Introduction

DB2 Data Warehouse Edition (DWE) Administration Console is a WebSphere®-based application that provides the capability to deploy and manage DWE SQL Warehouse (SQW) applications. Each application may contain one or more processes to perform data transformation steps to build and maintain data warehouses.

Once an instance of a process is started, it travels through different states before it is finished. Each state is recorded in the SQW metadata so that you can monitor the status of the running instances. While the process is executing, certain types of interruptions, such as a power failure during a state change, can cause the SQW Runtime to lose track of an instance -- the failure might not be recorded and the instance will thus be left in an inconsistent state.

The crash recovery utility performs a health check and recovery of the inconsistent SQW Runtime metadata caused by such unhandled interruptions. After the metadata is corrected by the tool, you can resume or terminate the failed instances as appropriate.

Note: An uncaught interrupt might have left inconsistencies in the status of source or target data, not just SQW Runtime metadata. The recovery tool does not tackle inconsistencies in source or target data, only in the SQW Runtime metadata.

This article illustrates the process instance state transitions and describes how inconsistent process instance metadata is repaired.


Understand the process instance states

The DWE Administration Console provides a Web-based client to install, manage, and administer SQW applications and processes (SQW Admin Console). Once an SQW application has been installed, you can start running process instances for processes that are packaged in the installed SQW application. Table 1, below, summarizes the various process instance states in SQW Runtime:

Table 1. Process instance states
StateFinal stateDescription
Not yet startedNoA process instance has been submitted for processing (only set if the instance is triggered manually)
StartedNoThe process instance has been registered and prepared for execution
RunningNoSQW Runtime is executing the process instance
SuccessfulYesThe process instance has finished without errors
Finished with WarningYesThe process instance has finished with warnings (at least one activity has failed, but is connected with an unconditional branch to the next activity)
StoppingNoUser has requested to cancel a running process instance
StoppedYesThe process instance has been cancelled
SuspendingNoUser has requested to suspend a running process instance
SuspendedNoThe process instance has been suspended; a suspend checkpoint has been written
ResumingNoUser has requested to resume a failed or suspended instance
FailedNoThe process instance has failed; a failure checkpoint has been written, which can be used to resume or terminate the instance
TerminatingNoUser has requested to terminate a failed process instance
TerminatedYesThe failed process instance has been terminated

During the execution of a process instance, the process instance state changes based on the progress of the execution. A process instance state can be categorized as transient, final, or intermediate.

Transient process instance states indicate that the process instance is expected to move to the next process instance state without any user intervention. Transience, however, offers no indication for the duration that a process instance will stay in that state. Transient process instance states:

  • Running
  • Stopping
  • Resuming
  • Suspending
  • Terminating
  • Not yet started
  • Started

Intermediate process instance states indicate that the process instance is not expected to move to the next process instance state without any user intervention. A user action is necessary to move process instances to the next state. Intermediate process instance states:

  • Failed
  • Suspended

Final process instance states indicate that the process instance has completed, and there are no process instance states in which to move. Final process instance states:

  • Finished
  • Finished with Warning
  • Stopped
  • Terminated

Process instance life cycle

Figure 1 describes the life cycle of an instance. As soon as you submit a process instance in the SQW Admin Console, it is marked Not yet started. This state is only used for instances that are started manually, not for scheduled instances. It bridges the time until the instance is picked up for execution. When a manual or scheduled request is received by the SQW Runtime, the instance is marked as Started.

When all the parameter and validation checks are completed, and the first activity in a process instance is ready to start, the process instance is marked as Running. During the running period, you can change the process instance state using the "Suspend" or "Stop" actions in the SQW Admin Console. When a process instance has finished, it is in a final state, and no further state change is allowed.

If the process fails, you can correct the errors with necessary cleanup, and restart the failed process instance. If you decide to permanently terminate the failed instance, the "Stop" action will mark the failed instance as Terminating, and the instance will be cancelled and then marked as Terminated.

Figure 1. Instance life cycle
Instance Life Cycle

To illustrate how a process instance is transitioned, Figure 2 includes a state diagram that describes how a process instance can move from one state to another state before reaching its final state. Note: Only processes in a final state can be removed from the SQW Admin Console.

Figure 2. Instance state transition
Instance state transition

User-initiated state changes

You can change certain process instance states to help you efficiently manage the active instances. You can take these actions:

  • Stop instances in Not Yet Started, Started, Running, Suspended, and Failed states
  • Resume instances in Suspending, Suspended, Failed, and Stopping states
  • Suspend instances in Running state

Note: Switching between transient states is possible until the state change request is picked up for processing by the SQW Runtime. This option can be used to quickly undo an action, for example, if a "stop" request has inadvertently been issued for a running process instance.

State-change request rules

In most cases, a state-change request is processed after the execution of the current activity either completes successfully or fails. A stop request for an instance in state Running will instead be applied after the current code unit within the current activity has been executed.

Thus, if you initiate a request to suspend a process instance and the last activity for this instance is currently executed, the state-change request is ignored; the active activity will continue until it completes successfully or fails, and the process instance status is updated accordingly.

To summarize: If a "Stop" or "Suspend" action is issued and:

  • the current activity fails, the request is ignored and the state is set to Failed.
  • the last activity is running, the state is set according to the result of the process execution.

When a process instance is started, SQW Admin Console and SQW Runtime update the process instance to show its progress. The following are the intermediate and transient states that are initiated by the system:

  • Started
  • Running
  • Failed
  • Suspended

All final states are managed by the system as well.


Recovery utility overview

Depending on the complexity of a process instance and the number of activities and code units packaged in the process, it may take a considerable amount of time to finish the instance execution. While the process instance is running, there may be circumstances where the process instance is interrupted unexpectedly (for example, if a power failure or a disk error causes a hard crash of the DWE Administration Console).

As a result of these interruptions, the SQW Runtime metadata may be in an inconsistent state and cause some process instances to be left in an unmanageable state. The recovery utility (see the Download section) is designed to correct the SQW Runtime metadata and put inconsistent process instances in a state that allows to either resume or terminate them.

Note: The tool is provided as-is and has been tested on Windows clients. However, this does not preclude using this tool against DWE installations which reside on supported UNIX platforms.

Dependencies

This recovery utility has a very minimal requirement on the server where it is to be installed. The goal is to enable you to recover the SQW Runtime metadata from any system that has Java (the JRE version should be at least 1.4.2) and DB2 client connectivity to the SQW control database. The user who invokes this tool is usually the DWE Admin Console administrator, who must have connect authority to the SQW control database and SELECT, INSERT, UPDATE, and DELETE privileges on the SQW metadata control tables.

The recovery utility provides a function to find inconsistent instances and to repair inconsistent instance metadata from a list of instances or one instance at a time. When running the tool without restriction, all instances in a non-final state are considered to be inconsistent instances. It is, therefore, important to apply the tool with caution.

Instructions

  1. Verify that your Windows-based computer contains, at minimum, the DB2 Runtime client and a valid Java Runtime Environment.
  2. Download the file dwe911_sqw_recovercdb.zip (see the Download section), and unzip it into your local directory.
  3. Extract the dwe911_sqw_recovercdb.jar file, which contains the class files required to use this tool. (Review the README file for latest updates.)
  4. Set up classpath on Windows:

    set classpath=%classpath%;%DWEROOT%\dwe911_sqw_recovercdb.jar; \
    %DB2HOME%\sqllib\java\db2jcc.jar;%DB2HOME%\sqllib\java\db2jcc_license_cu.jar;

    where
    %DWEROOT% is where you copy the download file to, and
    %DB2HOME% is your DB2 install directory.

  5. Display command syntax on Windows:

    java.exe com.ibm.datatools.etl.dis.MDBRecovery.RecoverMDB help
  6. List all inconsistent process instances (on Windows platform):

    java.exe com.ibm.datatools.etl.dis.MDBRecovery.RecoverMDB preview -hdb_server_hostname-rdb_server_port# -dsqwctrl-udweuser -p dwepwd-lc:\temp\recovermdb.log
  7. You can repair either all instances in a non-final state, one instance at a time, or a list of instances from a file. When using the tool with multiple instances, review the preview list to ensure no active process instances are affected. The tool treats instances in a non-final state as inconsistent instances left from the last interruption and, thus, might cause a conflict if applied to an actively running instance.

    (For a detailed list of the command syntax and arguments, see the README file provided with the utility.)

    1. Select the scope:
      • If you want to repair a single instance, provide the application name and instance name as arguments to the tool.
      • If you want to repair a list of instances, prepare a file containing the list of instances. Each entry should have the application name, followed by a comma delimiter, and then the instance name. Provide the filename as an input argument to the tool.
      • If you want to repair all instances in a non-final state:
        1. Ensure that no new instances can be started. (This will not affect existing instances!)

          Note: New instances can be started by the scheduling application or manually, for example, through the DWE Admin Console. To prevent a manual start of an instance, consider disabling the DWE Admin Console or revoking the "start process" authorization in the permissions file. To prevent new instances being started by the scheduling application, consider disabling the scheduling application or disabling the affected schedules. Another option is to disable the affected SQW applications; this will prevent running a new instance.

        2. Stop or suspend all SQW process instances that are running on the server and are not in the list of inconsistent instances
    2. Make a backup copy of the SQW control database.
    3. Run the recovery tool.

      To run the tool for a single instance, for example:

      java.exe com.ibm.datatools.etl.dis.MDBRecovery.RecoverMDB run -h db_server_hostname -r db_server_port# -dsqwctrl-udweuser-pdwepwd-amyapp-imyinst-lc:\temp\recovercdb.log-t

  8. Review the output log to ensure no errors are reported.

    You can supply the log file using option -l. The default log location is %your_home_dir%\recovercdb.log. You should keep this log file for future reference since it records all fixes made to the SQW Runtime metadata.

  9. Enable starting new instances and/or resume suspended instances as necessary.

Understanding inconsistent process instances

Depending on their state, the recovery tool modifies inconsistent process instances very differently.

Instances with Finished, Finished with Warning, Stopped, and Terminated states are already in a final state and not relevant in this context.

Instances with Not yet started and Started states have not started processing the activities in the instance. They are marked as inconsistent and resolved by being moved to the Failed state and adding an associated failure checkpoint.

Instances with Failed and Suspended states have already finished processing the current activity. They are checked for an existing checkpoint entry and resolved by adding the appropriate checkpoint, if it is missing.

Instances with Terminating state are resolved by removing any remaining checkpoint entry and moved into Terminated state.

The next sections describe the steps taken by the recovery utility to resolve inconsistencies for instances in states Running, Suspending, Stopping, and Resuming.

Note: It is not possible to determine in advance whether the current activity is the last one in a flow because not all activities in a process instance will run for each instance. All activities, except for the start activity, are triggered based on the finish status of the preceding activity. Therefore, for process instances in an inconsistent state, there is no special handling for the last activity that would otherwise be taken when a stop or suspend request is processed.

Running

If the inconsistent process is in the Running state, the instance status will be changed to Failed and will include an appropriate failure checkpoint, which enables you to resume or terminate it later.

The steps needed to clean up the instance depend on the state of the activities and code units in the instance. If there is no activity with state Running in an inconsistent process instance, the instance was either between activities or finishing up the running process when the interrupt occurred.

Note: The following is a list of scenarios for the processing tasks at the time of an interrupt that can cause an inconsistent process in the Running state:

  • A unit is currently executing for an activity on the success path
  • A unit is currently executing for an activity on a failure path
  • A unit is currently executing for an activity of the cleanup path in the process
  • An activity is currently executing a step between code units in the activity
  • The instance executes a step between activities; no activity is marked as Running
    (This is the only case that is guaranteed to have no inconsistency in source or target data caused through the execution of this instance; in all other cases, there might be incomplete paths in a dataflow EPG)

If a checkpoint is available, the tool:

  1. Ensures that the checkpoint is set to failed.
  2. Sets the instance state for units and activities in the Running state and sets the process instance state to Failed.

If no checkpoint is available and no activity is in Running state (the instance was between activities or at the end of the instance), the tool:

  1. Finds the last activity in the instance statistics that belongs to the "success" path.
  2. Inserts a failure checkpoint for this activity with mode "skip."
  3. Issues a warning that error handling for this activity might not be complete (the activities on the failure path might not have been completed), if this activity is in the Failed state.

If no checkpoint is available and an activity is in Running state, the tool inserts a failure checkpoint record and updates the process, activity, and code unit states.

Figure 3 illustrates the decision points for instances in Running state:

Figure 3. Inconsistent process instance in Running state
Inconsistent process instance in Running state

Suspending

The Suspending state is set when you request to suspend a running process instance. This conveys the information to the SQW Runtime to end the current processing with the option to resume the instance again at a later time. As long as this request has not been fully processed, the actual status of the process instance is the same as if the instance were in status Running. The approach for the recovery utility to clean up an instance with status Suspending is thus similar to that for instances with state Running, except for the treatment of suspend checkpoints.

If a suspend checkpoint record is found, the last processed activity in the inconsistent process has already been completed, and it is sufficient for the utility to update the status of the process instance to Suspended.

If no suspend checkpoint record is found, the process instance at the time of the interrupt was either processing an activity or had not completed inserting a checkpoint record. The recovery utility will insert a failure checkpoint and update the status of the process instance and, if necessary, the activity and unit instance to Failed.

Figure 4 illustrates the decision points for instances in Suspending state:

Figure 4. Inconsistent process instance in Suspending state
Inconsistent process instance in Suspending state

Stopping

The Stopping state is set when you request to stop a running process instance. This state change request is processed by the SQW Runtime after the execution of a code unit completes. If a failure occurs during that execution, the state change request is ignored, and the status is set to Failed. Until this processing has been completed, the actual status of the process instance is the same as if the instance were in status Running. Therefore. the approach for the recovery utility to clean up the inconsistent instance matches the steps for those in status Running, except for the following special considerations:

  • If a failure checkpoint is available, the checkpoint will be removed, the summary statistics will be updated to reflect the final state of the instance, and a Terminated state will be set for the process instance.
  • If no failure checkpoint is available, one of the following is true for the process instance at the time of the interrupt:
    • None of the activities has started
    • An activity is currently processed
    • The instance is executing a task in between activities

Resuming

The Resuming state is set when you request to restart a suspended or failed process instance. A request to resume from a transient state does not result in a Resuming state because it is immediately applied; that is, the status of such an instance is immediately set back to Running.

If the state Resuming is encountered for an inconsistent process instance, it indicates that the request to resume a suspended or failed instance has been interrupted. To resolve this, the recovery utility will do the following:

  • If a checkpoint record is found, the recovery tool sets the status of the instance back to Suspended or Failed as appropriate, which enables you to resume the instance again.
  • If no checkpoint record is found, the instance is treated by the recovery tool as if it had the status Running.

Summary

Process instances move through different states reflecting the progress of the execution. Due to certain failure situations, the system might not receive control to update the status and thus leave the instance in an inconsistent state. The recovery utility is a tool that you can run to check for and repair inconsistent process instance metadata in the DB2 Data Warehouse Edition SQL Warehousing Runtime control database. You can use the log file to verify the selected inconsistent process instances are repaired and view details about the steps that were taken.


Acknowledgement

The authors would like to thank Bob Rumsby and Gregg Snodgrass for their help in developing this paper.


Download

DescriptionNameSize
Crash recovery utilitydwe911_sqw_recovercdb.zip36KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=189049
ArticleTitle=Advanced topics for DB2 Data Warehouse Edition users, Part 1: Crash recovery utility for DB2 Data Warehouse Edition SQL Warehousing Tool
publish-date=01182007