Enterprises are increasingly implementing data warehouse architectures using multiple production systems to meet the higher business needs of availability and performance. These dual-active system environments use a secondary production system to provide high availability, immediate fail-over for disaster recovery, and load balancing of data warehouse processes. In order for both systems to be used simultaneously, dual-active systems need a data synchronization method to keep each system up to date with the latest data at roughly the same time. When working with Teradata data warehouses, dual load can be used for data synchronization with low-latency requirements.
In this article, we'll compare some data synchronization methods and explain why dual load is recommended for dual-active configurations. We'll describe the characteristics of a dual-load system at a high level, with emphasis on dual-load monitoring and the Teradata Multi-System Manager (TMSM). Finally, we'll show you how to use the Teradata Connector's TMSM integration features to enable monitoring in your dual-load environment.
Data synchronization methods
There are many ways to synchronize data for multi-system configurations. The following methods are the most commonly used:
- Table copy, which uses backup/archive/restore utilities or Teradata FastExport, FastLoad, BTEQ, or Teradata Parallel Transporter (TPT) scripts to perform a partial or full table copy. It provides the data synchronization solution when the data latency between the systems can be up to hours or days.
- Replication services, which requires the initial full table copy and incremental delta data update. The delta updates are provided through Teradata's partner company GoldenGate. GoldenGate provides a change data capture solution for Teradata databases. GoldenGate keeps each system synchronized by capturing data changes of one system and applying them to the other. It provides a low-latency solution for low and medium data volumes.
- Dual load, which uses an extract, transform, and load (ETL) tool to load a large amount of data into multiple systems in parallel. Dual load is the best option for loading dual systems with a large volume of external data feeds and low-latency data synchronization requirements, which is typically the case for mission-critical data and applications in a dual-active system.
Choosing the right method is an important decision for an enterprise, and it's likely you'll employ not just one but many of these methods based on your specific business needs. You'll want to look at many variables to decide the right one to use for each of your data warehouse's requirements. For example, you'll want to consider frequency of data change, latency goals, system size, implementation effort, existing source and load processes, and other factors to determine the correct strategy.
Why dual load?
For synchronization of large volumes of data with low latency, the preferred method is dual load. The goal behind dual load is to minimize the user's view of data latency between multiple systems. Each system in the environment is simultaneously loaded with the same data by an ETL application and from a single-source extraction. In a data warehouse, we typically see large amounts of data regularly loaded, but the data already residing in the warehouse (used for historical reporting and analysis) doesn't change often. For loading this type of data, like tables with large daily batches, dual load is the most resource-efficient way to ensure data availability on both systems. The ETL application needs to extract and transform the data only one time in the load process, then use that single source of data for loading into each system. For other types of data in your warehouse, where latency may not be as important, table copy or replication would do the job and with less effort than implementing synchronization of that data as part of a dual load. Dual load is used for high-volume and low-latency feeds, but can (and probably will) co-exist with other synchronization techniques in your warehouse.
Functions of a dual-load system
While the primary function of the ETL application in a dual load is to write the data into each system, an optimal dual-load solution will leverage the ETL application's support for parallelism, recovery, checkpoints, heterogeneous sources, and monitoring.
Enter IBM InfoSphere DataStage® 8.7 and the Teradata Connector.
The InfoSphere DataStage parallel processing capabilities and Teradata Connector integration with Teradata Parallel Transporter (TPT) allow a job to load data across multiple processes using multiple TPT driver sessions. Load jobs can scale as the volume of data increases, and sessions can be spread across multiple ETL nodes on both symmetric multiprocessing (SMP) and massively parallel processing configurations (MPP). For check-pointing, job sequences in DataStage can be configured to save checkpoint information so that in the event of a failure, only the remaining steps required in the job are restarted. Similarly, the Teradata Connector has built-in checkpoint and restart support with the Teradata load utilities.
For Information Server 8.7, in addition to TMSM integration, the Teradata Connector adds support for:
- LDAP authentication
- Certification with Teradata server 14.0
- Bulk-mode auto-restart
- No primary index tables
- UTF16 support for bulk load
If you're familiar with DataStage, you've seen its comprehensive support for data sources, including relational databases, real-time systems, file access, and much more. Each piece is an important part of dual-load architecture, but we will focus here on the latest enhancements to the connector to support dual-load monitoring and integration with TMSM.
Monitoring and TMSM integration
TMSM is used to monitor, administer, and control each Teradata system in the environment from a single interface. Applications can send events to TMSM to report the progress, health, data-manipulation language (DML), and other vital information of the dual-load jobs during job execution. You can then view these events, thresholds, or alerts and status of each system through Teradata's web-based Viewpoint Portal. TMSM collects a vast amount of information from each system, and the Viewpoint Portal presents it in a web-based graphical interface. When it comes to monitoring your multiple system load processes, you're only as good as the information the ETL application sends to TMSM. When using DataStage 8.7 in your dual load, the Teradata Connector will send comprehensive, accurate, and the most appropriate information to the TMSM.
In DataStage 8.7, the Teradata Connector can be configured to use the TMSM client interface for dispatching events, reporting progress, and issuing alerts during a dual load to a TMSM server. The connector will dispatch detailed events about when the load process starts, each step in the load, and when the load process ends. Each connector participating in the dual load can be configured to be part of the same unit of work in the TMSM, and therefore expected to execute the same set of events for a complete synchronized dual load. The TMSM will establish interdependencies between dual-load processes that are part of the same unit of work and reflect it in the Viewpoint Portal. When the processes are part of the same unit of work, TMSM will compare values at each event to verify that the same work was executed on both Teradata ecosystems and that each step generated the same results. Differences between the load processes could trigger an alert message to the user. This helps ensure that after the dual load, the systems are synchronized.
Enabling TMSM event tracking in the Teradata Connector
For InfoSphere DataStage 8.7, the Teradata Connector supports dispatching load events to a TMSM. The following sections walk through the configuration, job design, execution, and results for a job which uses TMSM monitoring with the Teradata Connector.
In addition to the Teradata Tools and Utilities (TTU) and Teradata Parallel Transporter (TPT) packages, you need to install the TMSM client 13.10 or newer on the DataStage engine tier to run jobs with TMSM dispatching. Jobs that do not use TMSM dispatching do not require the TMSM client. For more information about configuration and installation, see the "Installation and configuration requirements for the TMSM integration" chapter of your Information Server documentation.
When you install the TMSM client, you will be prompted to enter the host and port information of your TMSM server. The installation will create services for the TMSM monitor, configuration files for ActiveMQ JMS, and set up the environment to dispatch events to your TMSM server. For information about installing the TMSM client, consult your Teradata documentation.
Now let's look at how to configure the Teradata Connector to dispatch TMSM events. Here, we will focus only on the connector stage doing the load and not any stages that will do the extract or transform steps upstream. Each Teradata Connector stage that will do the load into the Teradata ecosystems should have its properties set to enable TMSM event dispatching. Depending on your dual-load setup, these stages may reside in different jobs or even different ETL servers, but the procedure is the same.
To enable TMSM event dispatching for a Teradata Connector load job, set the TMSM Event Options property to Yes. When this is enabled, you will have access to a sub-menu to configure the Unit of work (UOW) ID, UOW Class, and UOW Source properties.
Figure 1. Teradata Connector UI for enabling TMSM integration
UOW ID is a required property and an important part of the dual load. The UOW ID is a unique identifier set on each load process participating in a dual load. This identifier is used by the TMSM to track a unique set of data applied to a system and tells it which load processes to group together so it can verify synchronization. From the DataStage perspective, you would use the same UOW ID in each Teradata Connector stage doing the actual load into the table, but make the value unique to each set of data you load. As mentioned, the stages sharing the UOW ID could be in different jobs, different projects, or even different DataStage installations depending on your dual-load setup. This is transparent to the TMSM, and it will keep track of all load processes with the same UOW ID, regardless of which application is doing the load. (You could have a load process in DataStage and another using the FastLoad utility. If they use the same UOW ID, TMSM will establish dependencies between them.) In our job design in Figure 1, the UOW ID is "boca_backup_02132011," and we load to a server with Teradata Director Program ID (TDPID) "bocatera" (the TDPID would be set in the Server property of the connector). We're using the same UOW ID in another job that will be loading to a server with TDPID of "bocaterabak." Together, these jobs make up our dual load, and we'll execute them in parallel. We will see how these values are used in the TMSM after the load has completed. Notice that there is not a property in the connector to define the TMSM server. The TMSM server address and port are configured during installation of the TMSM client. Event payloads sent by the connector will be routed to the TMSM server by the TMSM services running on the client.
You may want to use a job parameter for the UOW ID if it's likely to change for each job run. With a job parameter, you can use a new UOW ID for each run without having to recompile the job. In addition, if you use a unique UOW ID for each (hourly, daily, monthly, etc.) load, then they will appear as different units of work in the TMSM and can be tracked and recorded separately. However, this is not a strict rule; it depends on what information you find useful in the TMSM. You may find it advantageous to reuse the same UOW ID for a dual-load job.
The UOW Class and UOW Source are optional user-defined properties which, along with the UOW ID and event information, can be sent to the TMSM. For example, the UOW Source can be used to define the name of the system the data is sourced from. If this is useful for you to capture in the TMSM, you can also set this in the connector UI. You can read more about these properties in the "Teradata Connector properties" chapter of your Information Server documentation or Teradata TMSM documentation.
When the job runs, the Teradata Connector will load data to the server using its default behavior. There is no difference in the schema, performance, or TTU/TPT usage in a connector job with TMSM dispatching enabled (TMSM Event Options set to Yes). When event reporting is enabled, in addition to writing the data to the table, the connector will attempt to load the TMSM client library, then dispatch a START event, multiple STEP events at different phases of the job, and an END event to the TMSM server. The START and END events inform the TMSM when a load job has started and finished, respectively. The START event will be dispatched for the first logon and the END event for the last logoff of the job. The STEP events represent intermediate steps in the job and are reported to the TMSM so it can verify that each load process participating in the dual load has gone through exactly the same functions in a job. The Teradata Connector will report STEP events to the TMSM for noteworthy functions in the job. Some of the steps the connector reports are the initial logon, when before SQL statements are executed, when the DML statement completed, and when a commit or checkpoint is executed. For a complete listing of events reported to the TMSM by the Teradata Connector, see the "Teradata Connector support for TMSM" chapter of your Information Server documentation.
While the job is running, you can monitor the load process in the Viewpoint Portal and see which steps have occurred. Figure 2 shows the events dispatched by the Teradata Connector we configured (see Figure 1). You can see that the UOW ID defined in the connector is recorded in the TMSM, along with the server bocatera, the DataStage job name, and other information about the process. Looking at the events, we see five events for this load process: Logon, BeforeSQL_Executed, Insert_Completed, Commit_Executed, and Logoff. TMSM will ensure that any other processes sharing this UOW ID will have the same events reported. If there are any differences, which would indicate the data is not synchronized, the TMSM would issue an alert to the user.
Figure 2. Viewing events dispatched by the Teradata Connector in the Process Details portlet of the Viewpoint Portal
In addition to the event information, the connector will send the Data Manipulation Language (DML) mode, health amount, alert code, and other properties used by the TMSM. A complete listing of TMSM client properties can be found in the TMSM documentation.
When the connector encounters an error during the load, it will dispatch an ALERT event to the TMSM. ALERT events are flagged in the Viewpoint Portal to catch your attention, and the portal will show that the dual load is not in a valid state. The connector will log the error code, error description, and the current step where the error occurred as part of the ALERT event. With this information, you can use the portal to identify on which DataStage server the error occurred, in which job and Teradata Connector stage the error was reported, and the reason for the error, all without ever opening DataStage Director. The TMSM gives you a unified view of your environment, and with the detailed data collected from DataStage, you can monitor all your Teradata Connector loads from within the Viewpoint Portal.
You may find it useful during job design to see which events are
dispatched to the TMSM from inside the DataStage Director log. By default, the
events reported to the TMSM are not printed in the log. But if you're
tweaking a job, it will help to see which events are going to be dispatched without actually having
them sent to the TMSM, a sort of debug mode. To support this behavior, you can set
CC_TERA_TMSM_LOG_EVENT environment variable. For more
information about this variable and its usage, see the Teradata Connector support for
TMSM chapter of your Information Server documentation. Figure 3 shows an example
message in the DataStage Directory that logs the TMSM START event payload.
Figure 3. START event payload logged in the DataStage Director
After the dual load completes, you can check the Viewpoint Portal to verify that each load process finished successfully and executed the same job steps. As you can see in the Process Health portlet in Figure 4, the Viewpoint Portal will group processes with the same UOW ID together. There are two processes with boca_backup_02132011, which is the UOW ID we defined in the connector stage. One of the processes is loading to bocatera, and another to bocaterabak. Since both finished successfully and executed the same steps, there is no alert, and the state of each process appears normal in the portlet.
Figure 4. Process Health portlet showing a successful dual load
We can gather from this information that the dual load completed without any errors and that the servers with TDPID bocatera and bocaterabak are now synchronized. We've looked at the Process Health and Process Detail portlets, but Viewpoint Portal has an array of portlets, each with useful information for you to monitor. You should also take a look at some of the other portlets like Ecosystem Health or Table Health to see what information has been collected from DataStage.
We have compared a few data synchronization strategies and showed why dual load is important for an active-active warehouse configuration. We learned that each synchronization strategy has its advantages and disadvantages, and that choosing the right one depends on many factors, primarily your data volume and latency requirements. It's likely a few of these methods will co-exist in your warehouse.
We briefly discussed some of the characteristics of a dual-load environment and saw the importance of having a dual-load environment that leverages an ETL application's support for parallelism, recovery, checkpoints, data transport methods, and monitoring. We learned how IBM InfoSphere DataStage 8.7 and the Teradata Connector support a dual-load environment's requirements, including its latest enhancements for dual-load monitoring and TMSM integration.
Finally, we walked through the job design, execution, and completion of a Teradata Connector job configured to dispatch events to the TMSM. We looked at screenshots of how to configure the connector UI for dual-load monitoring, and showed the data collected from the connector. The connector sends the TMSM a vast amount of information about the load process, allowing you to monitor all your load jobs.
The TMSM and Viewpoint are most useful when they have good integration with the ETL application executing the load. The more accurate, up to date, and pertinent the data collected by the ETL application, the more value added to your monitoring. InfoSphere DataStage 8.7 already supports dual load with its high-performance parallel capabilities and checkpoint mechanisms by integrating with the TPT interface. By leveraging the Teradata TMSM client interface, the connector can now report an extensive set of events and alerts to the TMSM, and adds monitoring support for multi-Teradata system environments. InfoSphere DataStage 8.7 and TMSM seamlessly integrate to make a powerful and easy to use solution for monitoring your dual-load jobs.
- See "Use InfoSphere Information Server to integrate Teradata data" for a detailed look at the Teradata Connector features, including restartability and checkpoints for bulk loading.
- For more information about InfoSphere Information Server 8.7, browse the InfoSphere Information Server 8.7 Information Center.
- Learn more about the Teradata Multi-System Manager and Teradata Dual Active.
- Learn more about Information Management at the developerWorks Information Management zone. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.