By now, you should have already seen that Informix 11.70 (Panther) has gone eGA.
Here's the link to the official announcement:
Panther is a very significant release containing features in many different areas including:
· Flexible Grid
· Application Development
There is a whole host of other collaterals describing all the features in Panther so I don’t want to duplicate the effort here. However, in line with my blog’s focus on Warehousing, I will highlight the significant Panther warehousing features in this blog entry.
For those of you familiar with the Informix product lines of 7.x, 8.x, and 9.x leading up to the current versions of 10.x and 11.x, one knows that the once touted convergence of XPS code line of 8.x with the IDS code line of 9.x (code name: Arrowhead) did not materialize after the IBM acquisition in 2001. I’m happy to report that as of 11.70, the Arrowhead project is now complete.
The significance of this convergence is that the warehousing capabilities of XPS, which have been very popular and well-received over the years, (and for which many key enterprise customers still depend on today), are now largely available in the IDS product. Customers can now expect the combination of key characteristics, i.e. availability, reliability, ease of admin, OLTP performance, to be equally applicable to warehousing workloads. While the XPS product will be maintained and supported indefinitely, customers now have a viable alternative to having a single enterprise RDBMS handle both workloads, either on a single server or in a cluster environment. If migration were to occur from XPS to IDS, it promises to be the easiest transition than any competitive products on the market due to the close proximity in architecture, utilities, administration, and capabilities between XPS and IDS.
There are two significant areas of increased capabilities in Panther for Warehousing. They are:
- Query processing for Warehousing – Star Join, Multi-index scan, Skip Scan and Light Scan
- Time-cyclic Data management – Ability to manage data organized by defined intervals, e.g. months or years, and to perform fragment operations online.
In this entry, I will first discuss advancements in Query Processing.
Up to now, IDS is not able to take advantage of using multiple indexes on a single table to speed query processing. Take for example that you have a CUSTOMER table containing millions; if not billions of rows (e.g. FACEBOOK’s Customer table should have over 500 million rows). Suppose further that you need to look for Male customers having a Masters degree with a HIGH income level that live in a certain zip code.
The SQL query would look like the following:
SELECT count (customer_id)
WHERE gender = “m”
AND income_category = “high”
AND education_level = “masters”
AND zip_code = “95032”;
IDS would currently process such “low-selectivity” predicates like other higher selectivity predicates by simply choosing the highest selectivity predicate, fetching each record (presumably using an index), then simply matching each record to see if the other predicates qualify. It is not difficult to see that many records will be retrieved and discarded before the resulting rows are returned to the user.
In Panther, the system can drastically reduce the number of rows fetched by deploying bit vector technology. Here, each row of the table is represented by a single bit in a bit vector. For each of the columns mentioned in the query, a separate index is deployed. For each predicate, e.g. zip code = “95032”, a number of bits, each representing a row that qualifies in the table is marked. Once the bit vectors are built for each predicate, these bit vectors can be combined using the AND operator where the resulting bit vector represents the intersection of all the conditions specified by the predicates. This single bit vector contains all the rows that need to be retrieved, presumably much fewer than using the previous method. The following diagram shows this method.
As a further optimization for faster retrieval, the Row IDs (RIDs) are sorted prior to the retrieval of the records from the table in order to facilitate sequential reads from the table. By “skipping” to only the records in the resulting bit vector in a sequential manner, known as SKIP SCAN, we have now further reduced the retrieval time needed.
The benefit of Multi-index scan goes beyond processing a single large table as you will see with the discussion on Star Join below.
Star Join is a join method designed to speed up queries in dimensionally modeled schema, called Star Schema. Dimensional Modeling (and the topic in the book) was invented by Ralph Kimball, also the founder of Red Brick Systems. Red Brick was acquired by Informix in 1998 and is part of the Informix product line. Just like XPS, Red Brick is actively supported and maintained by IBM and still used by hundreds of customers world-wide.
As the diagram below depicts, having multiple dimension tables surrounding a central Fact table gives the impression of a star, and thus its name.
I have already discussed the technologies behind Red Brick so for those interested, please refer to my earlier entries in this blog. Since its inception in the 90’s, other competitors have all implemented Star Joins in some way. Instead of materializing the Star Index, all other products have chosen to do Star Joins “on the fly”. XPS’ method is known as Push-Down Hash Join, which is now implemented in IDS as of the Panther release.
Again, without duplicating a lot of other materials written on this topic, the goal of having a Star Join is to reduce the number of records fetched from the Fact table (similar to the example in the previous example involving the Customer table), but also to minimize the number of records produced as intermediate result during pair-wise joins of multiple dimension tables.
For example, if a query using the Star Schema diagram depicted above involving dimension tables and the Fact table, a Left-Deep Tree involving the Fact table and all the dimensions would result, and would look like the following diagram.
Here F denotes the Fact Table and D1 denotes dimension table 1, D2 is dimension table 2, and so on. For the example, it assumes that the Fact table has 10 million rows with default selectivity of 1/1000, and each of the dimension tables have 10K rows with default selectivity of 1/10. While IDS does have the ability to do Hash Join, for which it is superior to standard Nested-Loop Join for warehousing queries, the intermediate results produced after each join is still prohibitively large as you can see from the example.
In Panther, the Star Join method is now included as part of the IDS’ optimizer costing algorithm. If conditions are met based on statistics of the tables and that foreign keys are defined within the Fact table corresponding to primary keys in the dimension tables, then the Star Join method will be chosen.
Instead of the Left Deep tree, a Star Join in IDS becomes a Right Deep Tree, taking advantage of Multi-Index Scan method described earlier to reduce the number rows being fetched from the Fact table, thus significantly reducing the intermediate result set for each of the Hash Joins performed with each dimension table, as shown in the diagram below.
Beta testing by customer/partners have shown significant improvement with such dimensional queries. Depending on the number of rows in each of the tables and the predicates involved, it is easy to see how performance can double, triple or more. The next blog entry will discuss advancements in time-cyclic data management.