The information provided applies to the Informix Server Version 11.70.xC2 and later.
This blog entry discusses some terms commonly used in data warehousing and their significance for the Informix Warehouse Accelerator. This knowledge leads to general recommendations for data mart design and a better understanding of the Accelerated Query Tables (AQTs) that tie the data marts in the accelerator to the Informix server. The main sections of this entry are:
- Terms and Definitions: Data mart, AQT, star and snowflake schemas, fact and dimension tables
- Data mart definition and its implications with Informix Warehouse Accelerator
- Subsets in a snowflake schema
- Overlapping snowflake schemas
- Tip on AQTs
Terms and Definitions
Informix Warehouse Accelerator internally organizes the user data in logical entities called data marts. Wikipedia defines the term data mart like this: "A data mart (DM) is the access layer of the data warehouse (DW) environment that is used to get data out to the users. The DM is a subset of the DW, usually oriented to a specific business line or team."
The accelerator takes this a step further. As the accelerator is mainly concerned with acceleration, a data mart should be both, useful and as minimal as possible. This avoids any overhead caused by unnecessary data. Additional data that does not need accelerated access can rest in the data warehouse on the Informix server instance.
AQTs are Accelerated Query Tables that provide to the Informix server all needed information about data marts existing in the accelerator. The actual data mart definitions in the form of XML files are only kept by the accelerator itself (and possibly the ISAO Studio GUI). Upon creation of a data mart, the accelerator refines the data mart definition into an SQL-like format and sends this information back to the Informix server. Informix server uses it to create the AQTs in the system catalog of the corresponding database. An AQT actually is a special view that is created only for the purpose of acceleration and not usable in any other way.
Informix server uses the AQTs mainly to decide for incoming queries, whether they can be accelerated (i.e. executed by the accelerator) or not (need to be executed by Informix server itself). Additionally, the AQT contains the name of the data mart it belongs to and the name of the accelerator instance where this data mart resides. Informix server will use this to send the acceleration request for the query to the correct accelerator. (One Informix server instance can be connected to several different accelerator instances that reside on different machines, but this detail is not of further interest to this discussion.)
Two facts can be deduced from this:
- A data mart in Informix Warehouse Accelerator belongs to a single database in Informix server. This means that a data mart cannot contain user data that is pulled together from different databases.
- A single data mart can have several AQTs in the corresponding database, with each AQT being a separate view in the database system catalog.
Before looking into the details of data mart design and the corresponding AQTs, I want to introduce a few terms frequently used in data warehousing and quote their short definition from Wikipedia:
"The star schema (also called star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema [...]."
"A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions."
"In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the center of a star schema or a snowflake schema, surrounded by dimension tables."
"In data warehousing, a dimension table is one of the set of companion tables to a fact table. [...] Contrary to fact tables, the dimension tables contain descriptive attributes (or fields) which are typically textual fields or discrete numbers that behave like text. These attributes are designed to serve two critical purposes: query constraining/filtering and query result set labeling."
Data mart definition and its implications with Informix Warehouse Accelerator
In general the data marts defined for the accelerator are snowflake schemas, even though they do not need to reflect the complete snowflake schema of a data warehouse. Often they are only a subset of the data warehouse schema, like a simpler star schema or even some degenerate form of a snowflake (like a single, linear 'branch' of a snowflake). An extremely degenerated snowflake would comprise a single fact table only without any dimension tables.
For queries to be accelerated by the accelerator two important prerequisites are that a) the fact table of the query is one of the fact tables in a data mart and b) the joins in the query are reflected as references in that data mart. The optimizer of the Informix server checks an incoming query by matching it against the existing AQTs. To aid this matching a single AQT exists for each fact table in the data mart.
The following graphic shows a rather simple snowflake schema that is a subset of an imaginary data warehouse. The table named "partlist" is the single fact table in the center of the snowflake. All other tables are dimension tables:
Creating a data mart with this snowflake schema as it is will have "partlist" as the single fact table and one AQT will be created in the database system catalog for this data mart. Queries that use "partlist" as fact table and the respective joins to the dimension tables can be matched against the AQT and accelerated with this data mart. Of course not all joins need to be used by a query. A query that only joins "partlist" with "article" can also be accelerated with this data mart. Even a query that only selects data from "partlist", without joining it to any other table, can be accelerated.
Subsets in a snowflake schema
But a query that joins "order" with "customer" and "customer" with "geography", without using "partlist", cannot be accelerated with such a data mart, because the only fact table, "partlist", is not used as fact table by the query. Instead the query uses "order" as fact table, but that is not a fact table in the data mart, it is 'only' a dimension table. The optimizer is not able to match this query against the AQT of this data mart.
A possible solution to this problem is to create a separate data mart with a subset of the above snowflake schema, containing only the tables "order", "customer" and "geography". This by itself can also be considered a snowflake schema, albeit small and somewhat degenerated. Here "order" is the single fact table and "customer" and "geography" are the dimension tables. It reveals the recursive nature of snowflake schemas, as with real snowflakes composed of ice crystals. An according AQT will be created in the database system catalog. Thus the optimizer will be able to match a query that uses "order" as fact table and "customer" and possibly "geography" as dimension tables. The above query can now be accelerated with this data mart.
However, as one data mart in Informix Warehouse Accelerator is an entity by its own, defining in this way a separate data mart containing a subset of tables that are already part of an existing data mart will duplicate the data in the accelerator. As all data of all data marts has to fit into memory, this is not an effective way of using memory resources.
The accelerator's solution to this dilemma is the ability to allow more than one fact table in a single data mart, even though such fact tables are strictly speaking not fact tables in the corresponding snowflake schema. Looking at the previous example it is obvious that tables "article", "order" and "customer" are prime candidates for the role of 'double agent' in the same data mart, being a dimension table for some queries, but a fact table for others. Determining these three tables as fact tables in the data mart definition will create a data mart of the same size as if only "partlist" is the single fact table. It is not necessary to define three separate data marts, each with one of those three tables as fact table, and therefore the data of the tables involved does not need to be duplicated in additional data marts. Instead, for the one data mart with the four fact tables, four corresponding AQTs will be created in the database system catalog. All the queries using one of the four tables as fact table (and the respective joins to dimension tables) can be matched against one of these four AQTs and thus can be accelerated.
As explained before, even a single fact table can be considered as a (very degenerated) snowflake schema. With that it follows that in a data mart in the accelerator every table can be a fact table. Referring to the above example, even the 'leaf node dimension tables' of the snowflake schema, "designer", "color", "label", "supplier" and "geography", could be designated as fact tables in the data mart definition. But it is questionable whether this is always a sensible thing to do. While it would allow queries using just one of these tables without any join to be accelerated, it is not guaranteed that there really will be an acceleration effect. If the dimension tables are relatively small, the overhead of sending the query to the accelerator and sending the result set back to Informix server may outweigh any acceleration effect. This is of course true for any query that only uses a single table (as fact table) without joining other tables. If the query requires a scan of the complete table but produces a small result set (e.g. when using an aggregate function), The accelerator may well be able to really accelerate the query. But if Informix server can gather all the data from an existing index only, without having to read actual data pages of the table, and the result set would still be considerably large, then the accelerator may not be able to execute the query fast enough to still have an acceleration effect after compensating the overhead.
When using the ISAO Studio GUI to design data marts, only table "partlist" of the above snowflake schema will be automatically designated as fact table. ISAO Studio will do this by analyzing the primary and foreign key relationships between all tables added to a data mart and single out as fact table the one table that is only referring to other tables, but not referred to. In the graphic above, table "partlist" has only outbound reference arrows but none coming inbound. Therefore it is automatically recognized as the fact table.
Designating dimension tables as additional fact tables in a data mart definition is a manual task in ISAO Studio to make sure that no undesired side effects are introduced otherwise.
Overlapping snowflake schemas
Of course a data mart can contain more than one table having only outbound references. Each of these tables will automatically be designated as fact table. These tables then will not have a 'double agent' role in the data mart as they are not a dimension table for any other table within the data mart. In fact, such a data mart contains several different snowflake schemas. This still may be a sensible data mart definition if the snowflake schemas overlap to some extent, e.g. some of the dimension tables are shared between the different snowflake schemas. Combining them into one data mart will avoid the duplication of the shared dimension tables. In the analogy to real snowflake ice crystals, this would be a conglomerate of snowflakes that are stuck together at the outer edges while each snowflake still has its own center, as depicted in the drawing. These snowflake centers correspond to the different fact tables in the data mart.
The AQTs in the database system catalog are special views that cannot and should not be used or manipulated with SQL statements. Normally they also cannot be seen, e.g. in the output of the dbschema utility the AQT definitions will be suppressed. Under certain circumstances however it may be desirable to see the AQT view definitions. Setting the UNIX shell environment variable AQT before starting the dbschema utility will include the AQT view definitions in the output. Set the environment variable with a command like the following (Bourne shell syntax):
AQT=1 ; export AQT