Optim Test Data Management for Teradata: Load techniques and performance tips

Whether it's about managing database archive life cycle, application decommissioning, or test data management using InfoSphere® Optim®, the performance challenges remain the same. A narrow maintenance window, moving billions of rows, complex architecture, and limited hardware resources all contribute to performance in some way or the other. This article provides an insight into best practices and comparisons of various load mechanisms in Optim support for Teradata, along with performance considerations for the same.

Share:

Mitesh Shah (miteshs@us.ibm.com), Senior DBA/Performance Lead, IBM

Author photo of Mitesh ShahMitesh is the Performance Lead for Optim products and holds a certification in Teradata. He is a Senior DBA and works closely with the Optim architecture and development teams. His focus is mainly on Optim performance profiling and troubleshooting, performance scalability testing using large volume data, and supporting customers to achieve the best in Optim performance across various platforms. He also holds certifications in Oracle and DB2 for Linux, UNIX, and Windows databases, and is actively researching techniques to improve product performance.



01 November 2012

Introduction

Teradata is one of the most widely used relational database management systems (RDBMS) in a warehousing and business intelligence environment. One of the common business drivers is to load large amounts of data in a given maintenance window. This article covers the most efficient methods to accomplish this by leveraging Optim and the Teradata load utilities, thus providing a wide range of support for a smooth data movement. This is a must read for all of the Optim administrators, Consultants, Teradata DBA, and analysts to understand and apply the concepts in a data warehousing environment.

From a technical perspective, some of the load considerations are as follows.

  • What kind of Optim LOAD request would be the best choice?
  • Is there LOBS data being loaded?
  • What kind of indexes does the table have?

Recommended training and knowledge

It is assumed that the user has a basic understanding of the Optim installation and configuration, so this article will not be covering them. It is also assumed that Optim DBALIAS connection is already defined to a Teradata server.

Optim requirements overview

The following components should be installed prior to Optim configuration with Teradata.

  • Infosphere Optim Test Data management 8.1 or higher.

  • Access to a Teradata database v12 or higher.

  • Teradata base support: Shared ICU libraries, GSS client, CLIv2, Preprocessor 2 (PP2).

  • Optim loader support for Teradata: Named Pipes Access Module, PIOM, Multi-Load, FastLoad, BTEQ.

  • Optim support for Teradata Parallel Transporter: TPT Components, TPT API, TPT Build, TPT OSL, TPT export operator.

  • Security: At the minimum, Optim database users require access to the database objects and the Select privileges to the DBC catalog. GRANT LOGON TO and GRANT SELECT ON DBC TO.


Architecture: Optim support for Teradata

Based on Teradata's parallel architecture, the rows in the table are distributed across all of the access module processes (AMP), which enables high-speed accessibility and allows parallel access for an SQL query. This is also true for DML operations, which translates into data distribution across all the AMPS in parallel. The choice of the primary index is a decision factor for even distribution, and thus, leading to higher performance. This comes into play when using Optim to manage the Archive Life Cycle or Test Data management by creating a subset of production data to be used for testing. The Performance SLAs make it even more important to pick the right configuration and architecture to maximize the resource usage and meet your goals. Optim provides those choices by integrating and exercising the right tools and techniques, as shown in Figure 1.

Figure 1. Architecture - Optim TDM/DG support for Teradata
Optim Architecture

INSERT versus LOAD

This is a common discussion topic and the advantages of Load outweigh the benefits of the Insert in most cases, especially when working with large volumes of data. You can look at Insert from various perspectives, but when the goal is to implement a Test Data management framework by moving data from production to a test environment, Optim plays an important role. When trying to move data to another table in the same database, there are a few mechanisms that could be used natively such as the following.

  • Single or Multiple INSERT statements.
  • CTAS (Create Table As).
  • Insert into Select from: This uses load parallelism under the covers, when the table is empty and is the most efficient. It will bypass the transient journal, thus reducing the logging overhead.

Optim load support in Teradata

Utilities such as (Fastload, Multi-load, BTEQ) can be used to accommodate a variety of use cases. Some take advantage of the block operations instead of plain SQL processing. The following section examines the strengths and features of each.

Fast load (Fastload)

This is the most common and popular utility to quickly load large amounts of data from an external data source into an empty table on the Teradata RDBMS. The sources may be a file on your client system, and it processes a series of Fastload commands and Teradata SQL commands.

The following are the Fastload requirements and advantages from an Optim perspective.

  • Input file can be an Optim Extract or Archive file. Optim will prepare the input data file (.000) and control file (.500) to load the data.
  • The table has to be EMPTY.
  • The path of the fastload.exe has to be set under Product Options -> Load -> Teradata.
  • Sessions parameter can be used to drive the performance. This is dependent on the number of AMPS and should be discussed with the DBA.
  • The target table should not have secondary indexes defined, and there is no journaling.
  • Only one target table can be processed at a time. Teradata Fastload does not support join indexes, foreign key references in target tables, and tables with secondary index defined. It is necessary to drop any of the constraints listed before loading, and then recreate them afterwards.

Multi-load (MLOAD)

This is another popular utility used to load large amounts of data. Unlike Fastload, MLOAD can be used to update, insert, upsert, and delete operations on large tables in bulk mode.

The following are the MLOAD requirements and advantages from an Optim perspective.

  • Multiple tables can be loaded at a time, and duplicate rows are allowed.
  • The Updates operations is in block mode (one physical write can update multiple rows).
  • It uses table-level locks and could provide highest-level throughput by assembling data rows into 64K blocks and writing them to disk on the AMPs. This is much faster than writing data one row at a time like BTEQ.
  • One of the advantages of MLOAD is that it will still continue if the AMP (Access Module Processor) fails, as long as the table is protected by the Fallback setting. When using FastLoad, you must restart the AMP to restart the job. At the same time, you can use the AMPCHECK option to make it work like FastLoad if needed.
  • MLOAD uses the following types of tables under the covers to manage the load process.
    • Error tables will be used to store any conversion, constraint, or uniqueness violations during a load.
    • Work tables are used to receive and sort data and SQL on each AMP prior to storing them permanently to disk.
    • Log table is used to store successful checkpoints during load processing in case a RESTART is needed.
  • Sometimes it might be necessary to house the supporting tables in a particular database. When these tables are to be stored in any database other than the users own default database, then you must give them a qualified name (databasename.tablename) in the script. This can be handled via OPTIM by means of Column or Table Maps.

BTEQ

This is the third option to load data and is mainly used when the source data has LOBS column. While this is rare, it does provide a channel to manage the loading of LOBS data.


Optim load file formats in Teradata

Optim provides a variety of file formats as part of the load process. This gives flexibility from a functional standpoint, and also takes into consideration the performance aspects of the Load. This article will cover some interesting insights and tips via the performance profiling exercise in the lab that will highlight the strengths of the load formats and their applicable use cases. It is very important to understand that these are not benchmark numbers and in no way indicate the actual performance of the product. This is simply to help you understand the mechanics behind the load formats, and how making the right choice can fine tune the overall load performance.

ASCII

This is a default, out-of-the-box, setting and can be used for regular fixed length data.

  • The choice of whether to use Teradata ASCII or Teradata Delimited may come down to performance, and the difference can be marginal.
  • Teradata ASCII is a fixed-width format. Upon conversion to ASCII, if the data is shorter than the maximum width of the column, then the data is padded with trailing blanks. This means more bytes will be sent over the network to the Teradata server, but you do not need to worry about choosing a delimiter.

ASCII DEL

This option allows the choice of delimiter in the load input file.

  • Teradata Delimited is a varying-length format with a delimiter character separating each column value.
  • Data is not padded, however, you must choose a delimiter that does not exist in your column data.
  • If the data includes free-form text such as a description field, then it may be difficult to choose an appropriate delimiter that you can be sure does not exist in the data.
  • There is a performance impact to using Teradata Delimited as well. When character columns are processed, Optim will pre-scan the data and issue an error if the delimiter is present.

ASCII with NO PRESCAN

Optim Load Request offers a rich set of choices for delimiters, including some control characters that are less likely to be present, and some of those choices will omit the pre-scan. This method would provide a better throughput when processing CSV type files into Teradata, as some of the binary delimiters can be used to indicate the bypass of the scan operation.

Named Pipes

This is a popular method to load data because it provides the following benefits.

  • It is not a file format, rather a mechanism to skip the creation of the intermediate Optim conversion file, and load the data directly into the Teradata database.
  • Named Pipes are memory structures that hold data based on FIFO protocol. Data is read from the Optim Extract/Archive file and written directly to the Pipe, which is then written to the table.
  • One of the biggest advantages is that there is no landing of an intermediate Convert file, as it happens with the other file format processing. This provides a performance boost due to reduction in I/O.
  • Named Pipes should be used when storage space on the Optim Server is limited.

Binary Load

Optim 9.1 provides this new file format for loading the data that was archived or extracted in native Teradata format (such as using a native Teradata DBAlias).

  • This format requires no data conversion or pre-scanning, so it should perform better than ASCII or ASCII Delimited.
  • The Loader input file is also smaller in size compared to the ASCII format, thus providing space savings.

Teradata load performance exercise

This section covers details around the actual performance tests from the laboratory, highlighting the benefits of the various load methods. These were executed using TPCH scale 100 data model on a Teradata 1650 appliance. Please keep in mind that the numbers shown in Figure 2 and Table 1 are not benchmarks, rather just a profiling exercise to highlight the advantages of the load method/formats.

Figure 2. Teradata load performance comparison
Performance Comparison of Load Methods
Table 1. Load method performance test
Load method/formatOptim convert (minutes)Load (Minutes)Process time (minutes)% improvementThroughput (MB/sec)
ASCII load 226103329NA1.46
ASCII load - Delimiter 240863261%1.48
Named pipesNANA22630%2.13
Binary load1202014055%3.45
Binary load (4 sessions)118912761%3.80
Binary load + named pipesNANA11964%4.00

Observations and learning points

  • Increasing the number of Fast Load sessions improves the LOAD by almost 50%.
  • Binary Load improves performance by 20-25% over Named Pipes/ASCII load.
  • Binary Load is very effective with tables containing numeric/date columns.
  • The Teradata fallback table has some overhead due to additional processing (3-5%).
  • The No Pre-Scan method could enhance load performance by leveraging binary (hex) delimiter, where the input source data has only Character data. This ensures that the delimiter is NOT part of the actual data, thus eliminating the need to pre-scan the input.
  • Named Pipes performance boost (approximately 20% compared to the ASCII load)
  • Binary Load minimizes Optim Convert overhead by approximately 25% (compared to ASCII).
  • Optim data and temp folder path settings via Optim personal options influence the performance of the overall Load process, and thus should be chosen carefully (Stripped/RAID 1+0 SAN volumes).

Conclusion

Performance consideration is an important aspect of the solution design and architecture of a Test Data management system. This article provides the fundamental load techniques that you must take into consideration when implementing InfoSphere Optim for Teradata.

In summary, you have learned about different LOAD techniques, and the support of various load file formats that enable Optim to achieve the path to performance optimization, and thus, provide the necessary choice to manage bulk data. However, in addition to best practices, application configuration and database tuning should not be ignored, and there might be additional tuning parameters from an architecture standpoint which are beyond the scope of this article.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=842892
ArticleTitle=Optim Test Data Management for Teradata: Load techniques and performance tips
publish-date=11012012