System Administration Certification exam 919 for Informix 11.70 prep, Part 7: Informix Data Warehousing

Archiving Informix database

In this tutorial, you'll learn about IBM® Informix® Data Warehousing concepts and the tools that you can use to create data warehouses and optimize your data warehouse queries. This tutorial prepares you for Part 7 of the System Administration Certification exam 919 for Informix v11.70.

Share:

Manjula Panthagani (manjulap@us.ibm.com), Advisory Software Engineer, IBM

Manjula Panthagani photoManjula Panthagani is a member of the Down System and Diagnostics Team for IBM Informix Technical Support, a team responsible for handling critical customer situations. Manjula has been working in this role for more than 10 years. She is IBM Certified System Administrator for Informix. She is one of the Redbooks authors of Data Warehousing with Informix server.



Sanjit Chakraborty (sanjitc@us.ibm.com), Advisory Software Engineer, IBM

Sanjit Chakraborty photoSanjit Chakraborty is a member of the Down System and Diagnostics Team for IBM Informix Technical Support, which is responsible for handling critical customer situations and developing support tools for use by the Technical Support Organization. Sanjit has worked more than 15 years in the information technology industry in various roles. He is an IBM Certified System Administrator for Informix and DB2, and a designated archiving subject matter expert. Sanjit developed several Informix features and Down System Support tools. He is also an author and technical reviewer of many technical articles, tutorials, and training course materials on various Informix topics.



06 December 2012

Before you start

About this series

Thinking about seeking certification on System Administration for IBM Informix version 11.70 (Exam 919)? If so, you've landed in the right spot to get started. This series of IBM Informix certification preparation tutorials covers all the topics you'll need to understand before you read that first exam question. Even if you're not seeking certification right away, this set of tutorials is a great place to start learning what's new in IBM Informix 11.70.

About this tutorial

In this tutorial, you'll learn about Data Warehousing concepts, strategies, utilities, and commands. The material provided here primarily covers the objectives in Section 7 of the exam, entitled Warehousing. Topics covered in this tutorial include the following.

  • Multi-index scans
  • Star join optimization
  • Fragment by list and interval
  • Fragment-level statistics
  • MERGE statement
  • External tables

Objectives

After completing this tutorial, you should be able to do the following.

  • Know the concepts and usages of Data Warehousing.
  • Understand the multi-index scans concept.
  • Increase query performance with the star join optimization feature.
  • Understand the use of fragmentation with Data Warehousing.
  • Take advantage of combine update, insert, and delete operations into a single statement.
  • Understand the use of an external table.

Prerequisites

To understand the material presented in this tutorial you must be familiar with the following.

  • The Informix environment (Configuration file and parameters, installation, and administration).
  • Database server commands (onstat, onmode, oncheck, dbschema).
  • Informix concepts and terminology (dbspaces, chunks, physical log, logical logs, checkpoint, and so on).

System requirements

You do not need a copy of Informix to complete this tutorial. However, you will get more out of the tutorial if you download the free trial version of Informix Innovator-C Edition to work along with this tutorial.


Data Warehousing overview

Currently, data warehouses are an integral parts of many businesses. Nevertheless, since legacy applications and various RDBMS are in use at most businesses, consolidating all of the data becomes a huge problem. The IBM Informix Warehouse Feature provides an integrated and simplified software platform to design and deploy a warehouse repository on your existing Informix infrastructure. Consolidating data using Informix for a data warehouse (DW) is an ideal solution for businesses who want to build end-to-end business intelligence (BI) and reporting solutions.

DW databases provide a decision support system (DSS) that you can use to evaluate the performance of an entire business over time. In the broadest sense, the term DW is used to refer to a database that contains very large storages of historical data. The data is stored as a series of snapshots, in which each record represents data at a specific time. By analyzing these snapshots, you can make comparisons between different time periods. You can then use these comparisons to help make important business decisions.

DW databases are optimized for data retrieval. The duplication or grouping of data, referred to as database de-normalization, increases query performance and is a natural outcome of the dimensional design of the DW, while an online transaction processing (OLTP) database automates day-to-day transactional operations. OLTP databases are optimized for data storage and strive to eliminate data duplication. A DW can be implemented in several different ways. You can use a single data management system, such as Informix, for OLTP and DSS environments. Or, you can build two separate environments depending on system requirements.

The meaning of the term DW can vary from people to people. A DW environment can encompass any of the following forms.

  • Data warehouse
  • Data mart
  • Operational data store
  • Data staging

Figure 1 is an overview of the software technologies and tools that are typically present or needed in the infrastructure of an end-to-end DW solution. It also shows how the data flows from the sources to the targets in the process of creating and maintaining the DW.

Figure 1. Big picture of DW solution
This image shows the big picture of data warehouse, including an overview of software and tools needed for the DW solution.

The DW itself can comprise one or more databases of the repository types (ODS, data staging, data mart, data warehouse). Some of these could be the final data warehouse version that the front-end BI tools will use, and some could be used as staging databases in more complex data consolidation projects. Depending on the business need, you may want to put different warehouse repositories available to different business intelligence tools, including reporting, dashboard, data analytics and so on. The ETL processes need to be automatic for the initial and periodic tasks of consolidating and summarizing data from the different data sources (typically, OLTP database systems, external repositories and files) into the warehousing databases. Depending on the DW requirement, the ETL processes can be deployed in different places, and can be working in several stages to deal with the complexity of data.

Informix Warehouse Feature

Let's take a look at how Informix can be used in a DW solution. You can more effectively leverage Informix for BI using Informix Warehouse capabilities to create and populate a data warehouse repository, and then utilize front-end analysis and reporting tools, like IBM Cognos, to provide BI dashboards and other types of analytic applications and reports on top of the warehouse repository. Informix users can deploy an end-to-end warehouse solution, simplifying operational complexity and reducing costs by using a single database server product for both operational and warehouse data.

Informix Warehouse has a component-based architecture with client and server pieces. The following software components are provided in Informix Warehouse.

  1. Warehouse client: The Informix Warehouse Design Studio used as a warehouse client. It's an Eclipse-based common design environment for defining the source and target databases involved in your DW project, creating and reverse engineering physical data models of your databases, and building SQL-based data flows and control flows to quickly and easily build in-database data movements and transformations into your warehouse.
  2. Warehouse server: The Informix Warehouse Administration Console used as a warehouse server. It's a Web-based application for administering database and system resources related to your warehouse, as well as deploying, scheduling, and monitoring the control flows previously created in Design Studio through processes called the SQL Warehousing (SQW) services. The Administration Console allows you to do the following.
    • Manage common resources, such as database connections and machine resources.
    • Schedule the execution of control flows (sequences of ELT data flows).
    • Monitor the execution status.

The following sections highlight some useful Informix warehouse functionalities that are available within the Informix product.


Multi-index scans

A search path based on an access method that uses more than one index on the same table is called a multi-index scan. The MULTI_INDEX or INDEX_ALL directive forces the query optimizer to consider a multi-index scan to search the specified table for qualifying rows. The argument list for the MULTI_INDEX or INDEX_ALL directive has the following semantics.

  • If you specify a table as the only argument to the directive, the optimizer considers all of the available indexes on that table, and uses all of them (or a subset) when it searches the table for qualifying rows.
  • If you specify a table and only a single index, the optimizer considers using only that index to scan the table.
  • If you specify a table and more than one index, the optimizer considers a search path that uses all of the specified indexes.

A multi-index scan path can also utilize a single index. In this case, it uses a skip scan method to access the table.

Example of MULTI_INDEX directives

Suppose that you have a table named emp that contains the columns emp_no, dept_no, and job_no, and for which the ids_dept_no index is defined on the dept_no column, and the idx_job_no index is defined on the job_no column.

In the following example, the access-method directive forces the optimizer to consider using a multi-index scan, based on the combined results of scanning both the idx_dept_no index on the dept_no column and the idx_job_no index on the job_no column.

SELECT {+MULTI_INDEX(emp idx_dept_no ids_job_no)} ...

Restrictions on multi-index scan paths for query execution

The transaction isolation level affects whether the MULTI_INDEX or INDEX_ALL directive can force a multi-index scan execution path, which is not available while the isolation level is Cursor Stability, or is Committed Read with the LAST COMMITTED option. This directive is supported, however, in the Dirty Read and Repeatable Read isolation levels, and in Committed Read without the LAST COMMITTED option.

The following additional restrictions apply to multi-index scan access paths.

  • The indexes must be B-tree indexes. These can be attached or detached indexes.
  • These directives are ignored for R-tree indexes, functional indexes, and indexes based on the Virtual Index Interface (VII).
  • The table cannot be a remote table, a pseudo-table, a system catalog table, an external table, or a hierarchical table.
  • A multi-index scan cannot support join predicates as index filters in the underlying index scans.
  • A multi-index scan ignores all columns of a composite index except the leading column.
  • DML statements that perform cascade deletes or declare statement local variables (SLVs) cannot use a multi-index scan.
  • Update queries that activate a FOR EACH ROW triggered action cannot use a multi-index scan.
  • In ANSI-compliant databases, the MULTI_INDEX or INDEX_ALL directive is not followed for a SELECT statement that has no ORDER BY clause, no GROUP BY clause, and no FOR READ ONLY clause, if the FROM clause specifies only a single table. In this special case, the query has implicit cursor behavior that conflicts with a multi-index scan access path.

Specifying the MULTI_INDEX or INDEX_ALL directive circumvents the usual optimizer requirement for statistics on the specified table. The optimizer normally requires at least low-level statistics on the table before considering multi-index scan path on the table.


Star join optimization

IBM Informix Server V11.70 introduces a star join optimization feature that can significantly increase query performance for queries on star schemas and snowflake schemas.

A star join query typically involves a fact table and several dimension tables, with join predicates between the fact table and each dimension table. Table-level predicates typically exist on some dimension tables as well.

The star join plan utilizes push-down hash join technology. The primary purpose of a push-down hash join is to reduce the number of rows processed from the fact table, which probes each of the hash tables. Since each dimension table contains selectivity that can be used to reduce the number of fact table rows for processing, instead of applying the selectivity from different dimension tables in a step-wise fashion. If you can combine two or more to the fact table at the same time, it should help reduce the number of fact table rows to be processed. Since the fact table is the largest table in the query, reducing the number of fact table rows to be processed early on should improve query performance.

Performance advantages of a star join plan

A star join plan utilizing push-down hash join technology makes hash joins more effective, primarily due to the reduction of fact table (probe table) rows early on. The larger the probe table is in a hash join, the slower the join is. When hash tables are built, join keys are pushed down and used to filter the rows of the fact table, and performance improves dramatically. The performance improvement depends on the selectivity of the filter on the dimension table from which the join keys are pushed down. If only a few rows are selected from the dimension table so that only a few keys are pushed down, then rows of the fact table might be reduced proportionally. The actual proportion depends on the distribution of rows with the pushed-down keys in the fact table.

When the optimizer can choose a star join plan

The optimizer can choose a star join plan only if a query involves tables from a star or snowflake schema that meets the following requirements.

  • Neither the fact table, nor any dimension table participates in an outer join (for example, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, or Informix style outer join).
  • No table is an external table, remote table, view, or derived table.
  • The join between the fact table and any dimension table is an equi-join, which makes a hash join possible.
  • The values in the join-key columns of the dimension table are unique. The optimizer uses data-distribution statistics to determine whether a column contains unique values. The schema does not need to specify explicitly that a column is a primary or unique key to produce a star join plan.
  • The query includes one or more filters on at least one of the dimension tables.
  • The number of rows selected from each dimension table is less than the number of rows selected from the fact table after all scan filters are applied.
  • All tables involved in the query must have at least UPDATE STATISTICS LOW run against the table.
  • PDQ priority must be set to ON, or a number between 1 and 100. Alternatively, implicit PDQ must be set in the current session.

Some other minor restrictions apply, such as rules governing subscripted keys.

Two kinds of push-downs

The pushed-down join keys can be used in an index pushdown or a bit-vector push-down.

An index push-down uses the pushed-down keys to perform index lookups on a foreign key index of the fact table. Index push-downs require only a single-column index on the fact table, although they can use a multi-column index if the required foreign-key column is the first key in the index.

An index push-down has the following advantages.

  • Because the index is used to access the table, I/O is more efficient. The database server optimizes I/O by reading the fact table in a skip scan, where the database server reads the index and sorts the row identifiers (RIDs) so rows can be read from the fact table pages in order, accessing only the table pages that contain required rows.
  • For multiple index push-downs, multi-index scan is used to combine RIDs from different index scans and then skip-scan on the fact table is performed.
  • The resulting list of RIDs contains exactly those tables that will join, which makes the join efficient.

In a bit-vector pushdown, for each push-down, the database server hashes the keys from the dimension table to create a bit vector, which is shipped to the fact table. Then for each row from the fact table scan, the foreign key is also hashed and probed into the bit vector to determine whether the current fact table row is filtered out. Multiple-bit vectors can be used for multiple bit-vector push-downs.

A bit-vector pushdown has the following advantages.

  • It requires no indexes.
  • It can be built fast with a low processing cost.

By reducing I/O costs, index push-downs eliminate fact table rows more efficiently than bit-vector push-downs, but their processing cost is higher.

Interpreting explain output for star join plans

  • Orders fact table, containing 100,000,000 rows. The Orders table has a single-column index on each foreign key that is a primary key in the dimension tables: date_time, location_id, and cust_num.
  • Time dimension table, containing 10,000 rows. The Time table has primary key date_time.
  • Location dimension table, containing 10,000 rows. The Location table has primary key location_id.
  • Customer dimension table, containing 10,000 rows. The Customer table has primary key cust_num.

To show how to interpret SET EXPLAIN output for star join plans, you can look at the example query shown in Listing 1.

Listing 1. Example query to interpret SET EXPLAIN output for star join plans
SELECT t.quarter, sum(o.quantity)
FROM orders o, location l, customer c, time t
WHERE o.location_id = l.location_id
    AND o.cust_num = c.cust_num
    AND o.date_time = t.date_time
    AND c.type <= 3
    AND l.region_id = 4
    AND t.quarter between 1 and 20 
GROUP BY t.quarter

The query plan information shown in the following examples is taken from the sqexplain.out file for this query.

In step 1, the join keys from the Location and Time dimension tables are pushed down to filter the Orders table before any hash joins are performed.

  • The location_id join keys that meet the filter requirements on the Location dimension table are pushed down to the index on the Orders table. The retrieved RIDs are sorted and used to read the Orders table in a skip scan.
  • The time_id join keys that meet the filter requirements on the Time dimension table are hashed to create a bit vector that filters rows of the Orders table from the skip scan, as shown in Listing 2.
Listing 2. Step 1 in the query plan
1) stores.o: INDEX PATH (SKIP SCAN) 
    (1) Index Name: stores.fk_location
        Index Keys: location_id   (Parallel, fragments: ALL) 
        Index Filter: stores.o.location_id = stream from stores.l.location_id 
    Bit Vector Filter: stores.o.time_id = stream from stores.t.time_id

In steps 2-4, as shown in Listing 3, the hash joins use the filtered rows of the Orders table to probe the hash tables built on the three dimension tables: first the hash join on the Customer dimension table, then the hash join on the Time dimension table, then the hash join on the Location dimension table.

Listing 3. Steps 2-4 in the query plan
2) stores.c: SEQUENTIAL SCAN  (Parallel, fragments: ALL) 
    Filters: stores.c.cust_type <= 3 
                    
DYNAMIC HASH JOIN 
    Dynamic Hash Filters: stores.o.cust_id = stores.c.cust_id 
                    
3) stores.t: SEQUENTIAL SCAN  (Parallel, fragments: ALL) 
    Filters: (stores.t.quarter >= 1 AND stores.t.quarter <= 20 ) 
                    
DYNAMIC HASH JOIN (Bit Vector Push Down Key: stores.t.time_id to stores.o) 
    Dynamic Hash Filters: stores.o.time_id = stores.t.time_id 
                    
4) stores.l: SEQUENTIAL SCAN  (Parallel, fragments: ALL) 
    Filters: stores.l.region_id = 4 
                    
DYNAMIC HASH JOIN (Index Push Down Key: stores.l.location_id to stores.o) 
    Dynamic Hash Filters: stores.o.location_id = stores.l.location_id

Snowflake schema support

Snowflake schema is an extension of star schema where there could be multiple levels of fact-dimension relationships. For example, the Location table is a dimension table for the Orders fact table. However, it could have its own dimension table, a Region table that separates different locations into distinct regions. The Location table contains a foreign key (region_id) that corresponds to the primary key of the Region table. The level of a dimension table corresponds to the number of links it needs to connect to the central fact table. For example, the Location table is a level-1 dimension table, while the Region table is a level-2 dimension table.

Encouraging or preventing star join optimization

The following new optimizer directives are introduced for star join optimization.

  • STAR_JOIN - The STAR_JOIN directive instructs the optimizer to favor a star join plan whenever possible.
  • AVOID_STAR_JOIN - The AVOID_STAR_JOIN directive instructs the optimizer to skip consideration of star join plans.
  • FACT(table) - The FACT directive specifies a fact table to be used during star join optimization. A single table can be specified.
  • AVOID_FACT(tab1, tab2, ...) - The AVOID_FACT directive specifies one or more tables that should not be used as fact table during star join optimization. One or more tables can be specified (delimited by comma or whitespace).

All of the new star join directives are query level directives that affect the execution of a query.

Session environment setting

New SET OPTIMIZATION ENVIRONMENT options are also introduced for influencing optimizer consideration of star join plans, as shown in Listing 4.

Listing 4. New SET OPTIMIZATION ENVIRONMENT options
SET OPTIMIZATION ENVIRONMENT STAR_JOIN 'enabled'|'disabled'|'forced'
SET OPTIMIZATION ENVIRONMENT FACT '<table_list>'
SET OPTIMIZATION ENVIRONMENT AVOID_FACT '<table_list>'
SET OPTIMIZATION ENVIRONMENT NON_DIM '<table_list>'

The SET OPTIMIZATION ENVIRONMENT STAR_JOIN statement can be used to enable, disable, or force star join plans for all queries in the current session. The default is enabled. When the forced option is used, star join plans will be favored whenever possible for all queries in current session.

The SET OPTIMIZATION ENVIRONMENT FACT statement can be used to specify a list of tables that can be considered as fact tables during star join optimization for all queries in the current session. If this environment is set, only tables specified in the list can be used as fact table.

The SET OPTIMIZATION ENVIRONMENT AVOID_FACT statement can be used to specify a list of tables that should not be considered as fact tables during star join optimization for all queries in the current session.

The SET OPTIMIZATION ENVIRONMENT NON_DIM statement can be used to specify a list of tables that should not be considered in any star join optimization, for all queries in the current session.

The format for <table_list> used in FACT, AVOID_FACT, and NON_DIM option is as follows.

[<database_name>:][<owner>.]<table_name>,[<database_name>:][<owner>.]<table_name>...

The database name and owner name are optional in table name specification. Multiple table names can be specified in the list, delimited by comma.

If star join directives are used in a query, directives override session environment setting.


Fragment by list and interval

Fragmenting by list

Fragmenting by list defines fragments that are each based upon a list of discrete values of the fragment key.

You can use this fragmentation strategy when the values of the fragment key are categories on a nominal scale that has no quantified order within the set of categories. Fragmenting by list is useful when a table contains a finite set of values for the fragment key, and queries on the table have an equality predicate on the fragment key. For example, you can fragment data geographically, based on a list of the states or provinces within a country. The rows that are stored in each fragment can be restricted to a single fragment key value, or to a list of values representing some logical subset of fragment key values, provided that no fragment key value is shared by two or more fragments.

Fragmenting by list also helps to logically segregate data, as shown in Listing 5.

Listing 5. Example of a list fragmented table
CREATE TABLE customer
    (id SERIAL, fname CHAR(32), lname CHAR(32), state CHAR(2), phone CHAR(12))
FRAGMENT BY LIST (state)
    PARTITION p0 VALUES ("KS", "IL") IN dbs0,
    PARTITION p1 VALUES ("CA", "OR") IN dbs1,
    PARTITION p2 VALUES ("NY", "MN") IN dbs2,
    PARTITION p3 VALUES (NULL) IN dbs3,
    PARTITION p4 REMAINDER IN dbs3;

In the previous example, the table is fragmented on column state, which is called the fragment key or partitioning key. The fragment key can be a column expression, as shown in the following example.

FRAGMENT BY LIST (SUBSTR(phone, 1, 3))

The fragment key expression can have multiple columns, as in the following example.

FRAGMENT BY LIST (fname[1,1] || lname[1,1])

The fragments must be non-overlapping, which means that duplicates are not allowed in the list values. For example, the following expression lists are not valid for fragments of the same table or index, because their KS expressions overlap.

PARTITION p0 VALUES ("KS", "IL") IN dbs0,
PARTITION p1 VALUES ("CA", "KS") IN dbs1,
PARTITION p0 VALUES ("KS", "IL", "KS") IN dbs0,

The list values must be constant literals. For example, the identifier or variable name is not allowed in the following list.

PARTITION p0 VALUES (name, "KS", "IL") IN dbs0,

A null fragment is a fragment that contains rows with NULL values for the fragment key column. You cannot mix NULL and other list values in a single fragment definition. For example, the following is not allowed.

PARTITION p0 VALUES ("KS", "IL", NULL) IN dbs0,

A remainder fragment is a fragment that stores the rows whose fragment key value does not match any expression in the expression lists of the explicitly defined fragments. If you define a remainder fragment, it must be the last fragment listed in the FRAGMENT BY or PARTITION BY clause that defines the list fragmentation strategy.

Fragmenting by list supports the following features.

  • Both a table and its indexes can be fragmented by list.
  • The fragment key can be a column expression based on a single column or on multiple columns.
  • The list can optionally include a remainder fragment.
  • The list can optionally include a NULL fragment that stores only NULL values.

Fragmenting by list must satisfy the following requirements.

  • The list cannot include duplicates.
  • The list that includes NULL (or IS NULL) cannot include any other value.
  • The fragment key must be based on a single row.
  • The fragment key must be a column expression. This can be based on a single column or on multiple columns.
  • An error is issued if a row matches no value on the list, and no remainder fragment is defined.

Fragmenting by interval

An interval fragmentation strategy partitions data into fragments based on an interval value of the fragment key. The interval value must be a column expression that references a single column of a numeric, DATE, or DATETIME data type. You can use this fragmentation strategy to assign quantified values of the fragment key to non-overlapping intervals within its range of numeric or time values. This kind of fragmentation strategy is useful when all possible fragment key values in a growing table are not known and the DBA does not want to allocate fragments for data that is not yet loaded.

To fragment a table or index according to intervals within the range of its index key, you must define the following parameters.

  • A fragmentation key, based on a numeric, DATE, or DATETIME column.
  • An interval size, specifying the range of fragment key values assigned to a single fragment.
  • One or more dbspaces in which to store fragments for values outside the initial fragment key range.
  • One or more named fragments defined by fragment key range expressions and stored in dbspaces.

Fragmentation by intervals of a range does not support a remainder fragment, because after the Interval Fragment clause defines the required parameters and at least one initial fragment, the database server can automatically define new fragments of the same interval size to store rows whose interval key values are outside the range of the initial fragments. The fragments that the server creates are stored in a round-robin fashion in dbspaces that you specify after the STORE IN keywords.

For example, you can create a fragment for every month or for every million customer records. Interval fragmentation requires that at least one fragment be based on a range expression. When rows are inserted with fragment-key values that do not fit in any of the range fragments that you defined, the database server automatically creates one or more new interval fragments, based on the same interval size within the fragment-key range, to store the new rows.

The CREATE INDEX statement supports the same RANGE interval fragmentation strategies. If a table has an attached index defined with the same FRAGMENT BY RANGE syntax, corresponding index fragments (with the same names as the new table fragments) are similarly created automatically when rows outside the existing intervals are inserted.

The fragment by interval fragmentation strategy is useful when all possible fragment key values in a growing table are not known, and the DBA does not want to pre-allocate fragments for data that is not yet there, as shown in Listing 6.

Listing 6. Example of an interval fragmented table
CREATE TABLE employee (id INTEGER, name CHAR(32),basepay DECIMAL (10,2), 
                       varpay DECIMAL (10,2), dept CHAR(2), hiredate DATE)
        FRAGMENT BY RANGE (id) 
        INTERVAL (100) STORE IN (dbs1, dbs2, dbs3, dbs4)
            PARTITION p0 VALUES IS NULL IN dbs0,
            PARTITION p1 VALUES < 200 IN dbs1,
            PARTITION p2 VALUES < 400 IN dbs2;

Here the value of the interval size is 100, the fragment key is the value of the employee.id column, and the VALUES IS NULL keywords define p0 as the table fragment to store rows with no id column value.

When employee ID exceeds 199, fragments are created automatically in intervals of 100, which is the interval value. The automatically created fragments are called interval fragments.

If a row is inserted with an employee ID of 405, a new fragment is created to accommodate the row. The new fragment holds values >= 400 AND < 500.

If a row is updated and the employee ID is modified to 821, a new fragment is created to accommodate the new row. The fragment holds values that are >= 800 AND < 900.

The initial range expressions for interval fragmentation are non-overlapping and there are no remainder fragments.

The fragment or partitioning key for interval fragmentation can reference only a single column. For example, the following is not allowed.

 FRAGMENT BY RANGE (basepay + varpay)

The fragment key can be a column expression, as in the following example.

 FRAGMENT BY RANGE ((ROUND(basepay))

The fragment key expression must evaluate to a numeric, DATE, or DATETIME data type.

No exclusive lock is required for fragment creation. The interval value must be nonzero positive constant expression of a numeric data type for numeric fragment keys, or of an INTERVAL data type for DATE or DATETIME fragment keys.

The interval fragments are created in round-robin fashion in the dbspaces specified in the STORE IN clause. The STORE IN clause is optional and, if omitted, interval fragments are created in the dbspaces of the initial fragments (dspaces dbs0, dbs1 and dbs2 in the previous example). If the dbspace selected for the interval fragment is full or down, the system skips that dbspace and selects the next one in the list.


Fragment-level statistics

Fragment-level statistics affects the way UPDATE STATISTICS MEDIUM/HIGH gathers data and generates column distribution on fragmented tables. Statistics generated by UPDATE STATISTICS LOW on tables and indexes have always been maintained at fragment level. These statistics are stored in catalog sysfragments.

Fragment-level statistics offer a finer granularity of statistics for fragmented tables. The statistics are calculated and stored at the individual fragment level.

You may want to consider fragment-level statistics in the following two main situations.

  • If you have a time-cyclic database schema where data changes are localized to rows with certain specific column values. For example, a sales table fragmented by transaction date column such that new rows are added each month to store current month sales data.
  • You have fragmented tables and you frequently use ALTER FRAGMENT ATTACH/DETACH to add or remove data from the fragmented table. Breaking the column distribution per fragment allows for statistics management at the fragment level. For example, refreshing of statistics can happen at fragment level, and fragment statistics can be added to, or removed from, table statistics.

Setting your system to use fragment-level statistics

Do the following to allow the server to build and use fragment-level statistics.

  1. Set ONCONFIG parameter AUTO_STAT_MODE to 1. This is a default setting and affects the entire system. If you wish to restrict the feature to certain databases only, you can alternatively set the session environment AUTO_STAT_MODE to 1 in your sysdbopen procedure for the desired databases.
  2. Ensure the ONCONFIG parameter SYSSBSPACENAME is set to a valid smartblob space and allocate the smartblob using the onspaces -c -S command. For example:
    onspaces -c -S sbspace -p /work/dbspaces/sbspace -s 100000 -o 0

    The amount of space required for a fragmented statistics of column depends on its datatype, number of fragments, and the resolution used in update statistics. The datatype of the column determines the actual number of bytes a single minibin requires.
  3. Consider increasing logical logs to account for the additional fragment-level statistics rows in the sysfragdist catalog. Since the fragmented statistics are more granular than table statistics, they do take up more space as well.

With these steps, the database server can now identify the candidate tables that can benefit from fragment-level statistics and creates them AUTOMATICALLY when the regular update statistics is run. The candidate tables are identified by using the following set of rules.

  • Table is fragmented by expression, LIST, or INTERVAL strategy.
  • Table has over a million rows.

Alternatively, you can explicitly indicate which tables should have fragmented statistics by tweaking the table property STATLEVEL explained in the next section.

STATLEVEL table property

STATLEVEL is the level or granularity of the column distributions. STATLEVEL can be specified with CREATE TABLE and/or ALTER TABLE commands.

  • TABLE - Distributions are created at table level.
  • FRAGMENT - Distributions are created and maintained at each fragment.
  • AUTO - In Automatic mode, apply rules to determine if fragment-level statistics should be created, otherwise default to table level distributions.

The following example forces the use of fragment-level statistics by setting the STATLEVEL at creation time.

Listing 7. Example of fragment-level statistics using STATLEVEL
CREATE TABLE tab1(col1 integer, col2 char(10))
FRAGMENT BY EXPRESSION
(col1 >= 0 AND col1 < 1000) IN dbspace1,
(col1 >= 1000 AND col1 < 2000) IN dbspace2,
(col1 >= 2000 AND col1 < 3000) IN dbspace3,
remainder in rootdbs
STATLEVEL FRAGMENT;

Note: The FRAGMENT option of STATLEVEL can be used for any fragmented table regardless of strategy used.

Alternatively, the following ALTER command can be used at a later time achieves the same effect.

 ALTER TABLE tab1 STATLEVEL FRAGMENT;

The new fragment-level statistics will be created and stored in the sysfragdist system catalog when UPDATE STATISTICS MEDIUM/HIGH is run on table tab1e.

Statistics and ALTER FRAGMENT

When you perform an ALTER FRAGMENT ATTACH/DETACH operation on a fragmented table with fragment-level statistics, the server will automatically refresh column statistics of the tables involved. Refreshing of the statistics runs in the background and the ALTER FRAGMENT command returns immediately after the alter operation is completed. The refreshing of statistics either starts after the alter operation is completed or if you are in a transaction, statistics refresh starts after the transaction is committed.

  • For ATTACH operation: fragmented statistics of the new fragment is built, and table-level statistics is rebuilt from all fragmented statistics. Any existing fragments with out-of-date column statistics will also be rebuilt at this time.
  • For DETACH operation: table-level statistics of the resulting tables are rebuilt from the fragmented statistics.

Note: Due to smarter statistics, any good fragmented statistics will be directly used for merging into table-level statistics. This is not equivalent to running a complete statistics refresh on resulting tables of an ALTER FRAGMENT.

The background task that refreshes statistics called refresh_table_stats is defined in the ph_task table of the sysadmin database. If any errors are found, it will print errors in online.log.


MERGE statement

The MERGE statement (sometimes referred to as UPSERT) allows a way to join two tables or views by combining update, insert and delete operations into a single statement. The merge is a DML statement that is especially useful in data warehouse environments, and in the Extraction-Transformation-Loading (ETL) cycle, where transaction tables are merged into bigger warehouse tables.

Depending on the condition you specified, a merge statement can have the following effect.

  • Updates rows in target table with rows from source table that match the condition.
  • Inserts new rows into target table from source that do not match the condition.
  • Deletes rows from target table based on the condition.

You must specify at least one clause like insert, update, or delete, otherwise MERGE statement will fail. However, you cannot use more than two of the previous effects, because the delete clause and the update clause are mutually exclusive.

Table 1 shows an example of how the MERGE statement works. In this example, contacts is the target table and validated is used as the source table. The current customer contacts records merged with newly validated records. If the customer contact has not been validated then it is deleted from contacts table, otherwise update the record.

Table 1. Examples of MERGE statements
SQL statement/clausesDescription
MERGEThe MERGE keyword and any optimizer directives if needed.
INTO contacts AS tThe INTO clause specifies the target table or view.
USING validated AS sThe USING clause represents the source table or view.
ON c.cust_id = v.cust_idThe ON clause is where you supply the join between the source and target tables.
WHEN NOT MATCHED THEN DELETEThe 'WHEN NOT MATCHED' clause instructs what to do on target table when matching records not found in the source table. Here record from target table will be deleted.
WHEN MATCHED THEN UPDATE SET (t.cust_id, t.email) = (s.cust_id, s.email); The 'WHEN MATCHED' clause instructs what to do on target table when you already have a matching record in source and target table. Here records on target table get updated based on the value in source table.

You can specify optimizer directives in the MERGE statement to specify how the source and target tables are joined, or to control other aspects of the execution plan. Some of the directives that can be used are USE_NL, USE_HASH, AVOID_HASH, FULL, AVOID_FULL, INDEX, ORDERED, and so on. In a high-availability cluster configuration, you can issue the MERGE statement from a primary server or from an updatable secondary server.

Any constraints on the target table are enforced during MERGE operations. If an error occurs while the MERGE statement is executing, then the entire statement is rolled back.

Triggers can be defined on source and/or target tables. But during the merge operation, only insert, update, and delete triggers are actionable. The select triggers are ignored. Depending on the MERGE statement, the insert, update, and delete triggers are activated before and after events. The for-each-row trigger is activated for each row that is inserted, updated, or deleted. Therefore during insert operation, only for-each-row insert triggers on the target table are activated. Similarly, during update and delete operations, only for-each-row update or for-each-row delete triggers on the target table are activated.

The MERGE statement supports distributed queries with the restriction that the target table cannot be a remote table. You can use remote table, synonym, and views references in the sub-queries within the SET and VALUES clauses. There are additional restrictions with target tables that it cannot be a virtual table, system catalog table, external table, view on which an instead of trigger defined, read-only view, pseudo-table with memory resident objects, and so on.


External tables

An External Table is basically an external data file to Informix database (not managed by Informix) that resides on the file-system as a regular flat file. The database server enables you to access records from this external data file as if it's in a table within the database.

When working in data warehouse environments, the ETL cycle frequently requires the user to load information from external sources. The external table provides a powerful technique using SQL interface for moving data into or out of the database. You can specify transformations of the data while mapping an external data file to the external table using filters and conversion process.

Part of the load record from external data file, Informix performs the conversion required to create the row and then inserts the row into the table. In case of any error it writes data to a reject file.

Use of external table during load

You need to use SQL statements to define an external table that includes a description of the data file, defines the table, and then specifies the data to load or unload. Perform the following steps to set up loading operation using external table.

  1. Define the external table. The CREATE EXTERNAL TABLE statement defines the location of the various external data files, which can be on disk, tape or comes from a pipe, and the format of the external data. Table 2 shows an example of CREATE EXTERNAL TABLE statement.
Table 2. Example of CREATE EXTERNAL TABLE statement
SQL Statement/ClausesDescription
CREATE EXTERNAL TABLE emp_ext (empid int, empdoj date)The column specification of external table. External table support all datatypes; Basic, UDTs, Byte/Text, Smart Blobs. You can use SAMEAS clause for copying column specification from a template table.
USING (Include external table options as mentioned in following rows.
DATAFILES ("DISK:/data/ empdata.unl"),DATAFILES clause specifies location (DISK or PIPE) of the external data file including full path name. You can use more than one file with this clause.
DELIMITER "|",Specifies the character that separates fields in external data file. Default delimiter is a vertical bar (|).
FORMAT "DELIMITED",FORMAT clause specifies the type of record formatting in the external data file. The database server converts external data from several data formats, including delimited and fixed ASCII (FIXED), and IBM Informix internal (INFORMIX formats). When FORMAT is not specified, DELIMITED is assumed by default.
NUMROWS 100000,Provides an approximate number of rows that an external table can have. This information is used by the optimizer to come up with an optimal plan when using the external table in the joins.
DBDATE 'y4md-', The data format of date column values in the external data file. Similarly you can use DBMONEY for money column values.
REJECTFILE "/ errlog/empdata.rej",Sets the full pathname where Informix writes rejected rows due to errors.
MAXERRORS 100,Sets the number of errors that are allowed before the database server stops the load.
DELUXE ;Specifies the mode of data load. Available options are DELUXE, EXPRESS, and DEFAULT. The EXPRESS mode can be only use when loading data to a RAW table. DEFAULT option leaves the decision to the Informix server.
  1. Create the table (if it doesn't exist) where loading data as follows.
    CREATE TABLE employee (empid int, empdoj date);
  2. Load the data, as follows.
    INSERT INTO employee SELECT * FROM emp_ext;

Use of external table during unload

You can unload data by creating an external table explicitly as shown previously in Table 2 Table 2, and inserting the data into it, or by selecting data from an internal table into an external data file. The unload operation can be performed in parallel by initiating a query that runs in parallel and writes its output to multiple external data files. Informix uses a round-robin technique to equalize the number of rows in the external data files. By default, data will be unloaded in a delimited-ASCII text file.

In the following example, data will be selected from a customer table and unloaded to an external table 'ext_customer', which uses an external data file on disk located at '/data/customer.dat'.

SELECT * FROM customer INTO EXTERNAL ext_customer
USING (DATAFILES ("DISK:/data/customer.dat"));

Unlike a TEMP table, the external table has a definition that remains in the catalog until it is dropped. Three new catalog tables sysexternal, sysextdfiles and sysextcols are added to track external tables.

External table in high-availability environment

In most cases, secondary servers use external tables the same way as the primary server. You can use it with load and unload operations. However, loading data on SDS, RSS, or HDR secondary servers is slower than loading data on the primary server. Remember, creation of an external table in a high-availability environment only replicates the schema of the external table, not the actual external data file.

Restrictions

There are certain restrictions associated with external tables. The following is a list of some limitations with external tables.

  • Cannot create an index or primary/unique/foreign key constraints.
  • Triggers are not supported.
  • You can't use an external table as a target table in the MERGE statement.
  • Explicit truncate of external table is not possible. Each time you run an INSERT statement to external table; it automatically overwrites any previous external data file associated to the table.
  • You cannot use DATAFILE clause as PIPE if base or external table has BLOB or CLOB datatype and must use DELIMITED with FORMAT clause.
  • UPDATE STATISTICS is not allowed on external tables.
  • The external data files may not be compressed.

Conclusion

This tutorial focuses on Informix Data Warehousing concepts, strategies, utilities, and commands. It tells you what you need to know about warehousing before taking the Informix system administration certification exam. You should now have a better understanding of Informix warehousing operations, and you should be able to do the following.

  • Know the concepts and usages of Data Warehousing.
  • Understand the multi-index scans concept.
  • Increase query performance with the star join optimization feature.
  • Understand the use of fragmentation with Data Warehousing.
  • Take advantage of combine update, insert, and delete operations into a single statement.
  • Understand the use of an external table.

Resources

Learn

Get products and technologies

  • Download trial version of including Informix Innovator-C Edition.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=847907
ArticleTitle=System Administration Certification exam 919 for Informix 11.70 prep, Part 7: Informix Data Warehousing
publish-date=12062012