Teradata is one of the most widely used relational database management systems (RDBMS) in a warehousing and business intelligence environment. One of the common business drivers is to load large amounts of data in a given maintenance window. This article covers the most efficient methods to accomplish this by leveraging Optim and the Teradata load utilities, thus providing a wide range of support for a smooth data movement. This is a must read for all of the Optim administrators, Consultants, Teradata DBA, and analysts to understand and apply the concepts in a data warehousing environment.
From a technical perspective, some of the load considerations are as follows.
- What kind of Optim LOAD request would be the best choice?
- Is there LOBS data being loaded?
- What kind of indexes does the table have?
Recommended training and knowledge
It is assumed that the user has a basic understanding of the Optim installation and configuration, so this article will not be covering them. It is also assumed that Optim DBALIAS connection is already defined to a Teradata server.
Optim requirements overview
The following components should be installed prior to Optim configuration with Teradata.
Infosphere Optim Test Data management 8.1 or higher.
Access to a Teradata database v12 or higher.
Teradata base support: Shared ICU libraries, GSS client, CLIv2, Preprocessor 2 (PP2).
Optim loader support for Teradata: Named Pipes Access Module, PIOM, Multi-Load, FastLoad, BTEQ.
Optim support for Teradata Parallel Transporter: TPT Components, TPT API, TPT Build, TPT OSL, TPT export operator.
Security: At the minimum, Optim database users require access to the database objects and the Select privileges to the DBC catalog. GRANT LOGON TO and GRANT SELECT ON DBC TO.
Architecture: Optim support for Teradata
Based on Teradata's parallel architecture, the rows in the table are distributed across all of the access module processes (AMP), which enables high-speed accessibility and allows parallel access for an SQL query. This is also true for DML operations, which translates into data distribution across all the AMPS in parallel. The choice of the primary index is a decision factor for even distribution, and thus, leading to higher performance. This comes into play when using Optim to manage the Archive Life Cycle or Test Data management by creating a subset of production data to be used for testing. The Performance SLAs make it even more important to pick the right configuration and architecture to maximize the resource usage and meet your goals. Optim provides those choices by integrating and exercising the right tools and techniques, as shown in Figure 1.
Figure 1. Architecture - Optim TDM/DG support for Teradata
INSERT versus LOAD
This is a common discussion topic and the advantages of Load outweigh the benefits of the Insert in most cases, especially when working with large volumes of data. You can look at Insert from various perspectives, but when the goal is to implement a Test Data management framework by moving data from production to a test environment, Optim plays an important role. When trying to move data to another table in the same database, there are a few mechanisms that could be used natively such as the following.
- Single or Multiple INSERT statements.
- CTAS (Create Table As).
- Insert into Select from: This uses load parallelism under the covers, when the table is empty and is the most efficient. It will bypass the transient journal, thus reducing the logging overhead.
Optim load support in Teradata
Utilities such as (Fastload, Multi-load, BTEQ) can be used to accommodate a variety of use cases. Some take advantage of the block operations instead of plain SQL processing. The following section examines the strengths and features of each.
Fast load (Fastload)
This is the most common and popular utility to quickly load large amounts of data from an external data source into an empty table on the Teradata RDBMS. The sources may be a file on your client system, and it processes a series of Fastload commands and Teradata SQL commands.
The following are the Fastload requirements and advantages from an Optim perspective.
- Input file can be an Optim Extract or Archive file. Optim will prepare the input data file (.000) and control file (.500) to load the data.
- The table has to be EMPTY.
- The path of the fastload.exe has to be set under Product Options -> Load -> Teradata.
- Sessions parameter can be used to drive the performance. This is dependent on the number of AMPS and should be discussed with the DBA.
- The target table should not have secondary indexes defined, and there is no journaling.
- Only one target table can be processed at a time. Teradata Fastload does not support join indexes, foreign key references in target tables, and tables with secondary index defined. It is necessary to drop any of the constraints listed before loading, and then recreate them afterwards.
This is another popular utility used to load large amounts of data. Unlike Fastload, MLOAD can be used to update, insert, upsert, and delete operations on large tables in bulk mode.
The following are the MLOAD requirements and advantages from an Optim perspective.
- Multiple tables can be loaded at a time, and duplicate rows are allowed.
- The Updates operations is in block mode (one physical write can update multiple rows).
- It uses table-level locks and could provide highest-level throughput by assembling data rows into 64K blocks and writing them to disk on the AMPs. This is much faster than writing data one row at a time like BTEQ.
- One of the advantages of MLOAD is that it will still continue if the AMP (Access Module Processor) fails, as long as the table is protected by the Fallback setting. When using FastLoad, you must restart the AMP to restart the job. At the same time, you can use the AMPCHECK option to make it work like FastLoad if needed.
- MLOAD uses the following types of tables under the covers to manage
the load process.
- Error tables will be used to store any conversion, constraint, or uniqueness violations during a load.
- Work tables are used to receive and sort data and SQL on each AMP prior to storing them permanently to disk.
- Log table is used to store successful checkpoints during load processing in case a RESTART is needed.
- Sometimes it might be necessary to house the supporting tables in a particular database. When these tables are to be stored in any database other than the users own default database, then you must give them a qualified name (databasename.tablename) in the script. This can be handled via OPTIM by means of Column or Table Maps.
This is the third option to load data and is mainly used when the source data has LOBS column. While this is rare, it does provide a channel to manage the loading of LOBS data.
Optim load file formats in Teradata
Optim provides a variety of file formats as part of the load process. This gives flexibility from a functional standpoint, and also takes into consideration the performance aspects of the Load. This article will cover some interesting insights and tips via the performance profiling exercise in the lab that will highlight the strengths of the load formats and their applicable use cases. It is very important to understand that these are not benchmark numbers and in no way indicate the actual performance of the product. This is simply to help you understand the mechanics behind the load formats, and how making the right choice can fine tune the overall load performance.
This is a default, out-of-the-box, setting and can be used for regular fixed length data.
- The choice of whether to use Teradata ASCII or Teradata Delimited may come down to performance, and the difference can be marginal.
- Teradata ASCII is a fixed-width format. Upon conversion to ASCII, if the data is shorter than the maximum width of the column, then the data is padded with trailing blanks. This means more bytes will be sent over the network to the Teradata server, but you do not need to worry about choosing a delimiter.
This option allows the choice of delimiter in the load input file.
- Teradata Delimited is a varying-length format with a delimiter character separating each column value.
- Data is not padded, however, you must choose a delimiter that does not exist in your column data.
- If the data includes free-form text such as a description field, then it may be difficult to choose an appropriate delimiter that you can be sure does not exist in the data.
- There is a performance impact to using Teradata Delimited as well. When character columns are processed, Optim will pre-scan the data and issue an error if the delimiter is present.
ASCII with NO PRESCAN
Optim Load Request offers a rich set of choices for delimiters, including some control characters that are less likely to be present, and some of those choices will omit the pre-scan. This method would provide a better throughput when processing CSV type files into Teradata, as some of the binary delimiters can be used to indicate the bypass of the scan operation.
This is a popular method to load data because it provides the following benefits.
- It is not a file format, rather a mechanism to skip the creation of the intermediate Optim conversion file, and load the data directly into the Teradata database.
- Named Pipes are memory structures that hold data based on FIFO protocol. Data is read from the Optim Extract/Archive file and written directly to the Pipe, which is then written to the table.
- One of the biggest advantages is that there is no landing of an intermediate Convert file, as it happens with the other file format processing. This provides a performance boost due to reduction in I/O.
- Named Pipes should be used when storage space on the Optim Server is limited.
Optim 9.1 provides this new file format for loading the data that was archived or extracted in native Teradata format (such as using a native Teradata DBAlias).
- This format requires no data conversion or pre-scanning, so it should perform better than ASCII or ASCII Delimited.
- The Loader input file is also smaller in size compared to the ASCII format, thus providing space savings.
Teradata load performance exercise
This section covers details around the actual performance tests from the laboratory, highlighting the benefits of the various load methods. These were executed using TPCH scale 100 data model on a Teradata 1650 appliance. Please keep in mind that the numbers shown in Figure 2 and Table 1 are not benchmarks, rather just a profiling exercise to highlight the advantages of the load method/formats.
Figure 2. Teradata load performance comparison
Table 1. Load method performance test
|Load method/format||Optim convert (minutes)||Load (Minutes)||Process time (minutes)||% improvement||Throughput (MB/sec)|
|ASCII load - Delimiter||240||86||326||1%||1.48|
|Binary load (4 sessions)||118||9||127||61%||3.80|
|Binary load + named pipes||NA||NA||119||64%||4.00|
Observations and learning points
- Increasing the number of Fast Load sessions improves the LOAD by almost 50%.
- Binary Load improves performance by 20-25% over Named Pipes/ASCII load.
- Binary Load is very effective with tables containing numeric/date columns.
- The Teradata fallback table has some overhead due to additional processing (3-5%).
- The No Pre-Scan method could enhance load performance by leveraging binary (hex) delimiter, where the input source data has only Character data. This ensures that the delimiter is NOT part of the actual data, thus eliminating the need to pre-scan the input.
- Named Pipes performance boost (approximately 20% compared to the ASCII load)
- Binary Load minimizes Optim Convert overhead by approximately 25% (compared to ASCII).
- Optim data and temp folder path settings via Optim personal options influence the performance of the overall Load process, and thus should be chosen carefully (Stripped/RAID 1+0 SAN volumes).
Performance consideration is an important aspect of the solution design and architecture of a Test Data management system. This article provides the fundamental load techniques that you must take into consideration when implementing InfoSphere Optim for Teradata.
In summary, you have learned about different LOAD techniques, and the support of various load file formats that enable Optim to achieve the path to performance optimization, and thus, provide the necessary choice to manage bulk data. However, in addition to best practices, application configuration and database tuning should not be ignored, and there might be additional tuning parameters from an architecture standpoint which are beyond the scope of this article.
- Read the "Optim documentation" for detailed information about Optim.
- Learn more about "Teradata Load and Unload" utilities.
- Visit the developerWorks Information Management zone to find more resources for DB2 developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Participate in the discussion forum.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.