SQL OLAP Functions: A Primer
Informix is adding a set of SQL OLAP functions in an upcoming release. The following write-up is to introduce the concept of such functions which may be new to Informix users. It is not meant to be a reference manual that lists the complete syntax and optional parameters. These analytic functions are valuable to users because they simplify the expression of complex but commonly used business questions and are efficient without resorting to subqueries or other calculations.
OLAP is OnLine Analytical Processing (as opposed to OnLine Transaction Processing – OLTP). OLAP is generally described as part of Business Intelligence that allows users to perform business reporting for sales, marketing, inventory control, business processes, etc. It usually works with multi-dimensional databases requiring frequent roll-up, roll-down and slicing and dicing.
While OLAP is the overall encompassing term for the type of processing to be done, there are specific functions known as SQL OLAP functions that are supported by different database systems and/or BI tools such as Cognos, Business Object, etc.
SQL OLAP functions are not new. In fact, they adhere to and extend the ANSI SQL-99 standards, known as the “Elementary OLAP” package. Many other databases and BI tools have provided these functions for some number of years, but SQL OLAP functions have not been supported by the Informix Dynamic Server (IDS) until now.
The OLAP functions to be supported by Informix are as follows:
· Ranking RANK(), DENSE_RANK(), DENSERANK(), CUME_DIST(),
· Numbering ROW_NUMBER(), ROWNUMBER()
· Aggregate RATIO_TO_REPORT, RATIOTOREPORT
· First/Last FIRST, LAST
Note that if the underlying database does not support these OLAP functions, as in the case with Informix currently, then the user must rely on the BI tool to perform these calculations. If the underlying database system supports such OLAP functions, users can choose to submit SQL directly to the database or to submit it through the BI tool. In order to take advantage of the underlying data base support of OLAP functions, the tool must recognize this fact and generates the appropriate SQL to take advantage of it. The difference in performance can be enormous, as in orders of magnitude differences. A new version of Cognos Enterprise BI will generate SQL to take advantage of the OLAP functions implemented in IDS in an upcoming release.
SQL OLAP Functions: Concepts
A window partition is a subset of rows returned by a query, as defined by one or more columns in a special OVER() clause. SQL OLAP functions are applied to these partitioned result sets within the scope of a single query expression.
Olap_function () over (partition by col1, col2…)
Non-analytic functions compute their calculations with respect to every row in the result set. However, an OLAP function in a query is applied with respect to the contents of its window partitions. An example with the RANK function is as follows:
Select date, store_name, sum(dollars) as sales_dols,
rank () over (partition by date order by sales_dols desc) as date_rank
from period, store, sales
where period.perkey = sales.perkey
and store.storekey = sales.storekey
and state = “CA”
group by date, store_name
order by date;
DATE STORE_NAME SALES_DOL DATE_RANK
2012-01-02 Beaches Brew 785.55 1
2012-01-02 Roasters, Los Gatos 762.25 2
2012-01-02 Los Gatos Roasting Company 636.25 3
2012-01-02 Cupertino Coffee Supply 634.00 4
2012-01-02 Instant Coffee 457.75 5
2012-01-03 Instant Coffee 713.75 1
2012-01-03 Los Gatos Roasting Company 633.25 2
2012-01-04 Instant Coffee 1031.50 1
2012-01-04 Los Gatos Roasting Company 613.95 2
The query returns ordinal rankings for sales figures. The stores are ranked according to their sales totals for each value; as the date changes, the ranking values are reset. Multiple sets of ranked values (1 to n) are computed within a single query expression.
In practice, OLAP functions allow application developers to compose analytic business queries more easily and more efficiently. For example, moving averages and moving sums can be calculated over various intervals, aggregations and ranks can be reset as selected column values change; and complex ratios can be expressed in simple terms. Within the scope of a single query expression, several different OLAP functions can be defined, each with its own partitioning rules.
Types of OLAP Functions
OLAP functions supported by Informix fall into four categories:
- DENSE_RANK and DENSERANK (synonyms)
- ROW_NUMBER and ROWNUMBER (synonyms)
- RATIO_TO_REPORT AND RATIOTOREPORT (synonyms)
The OVER Clause
The OLAP OVER clause differentiates OLAP functions from other analytic or reporting functions. The OVER clause has three distinct capabilities:
- Defining window partitions (PARTITION BY clause)
- Ordering rows within partitions (ORDER BY clause)
- Defining window frames (ROWS/RANGE specification)
Depending on the type of OLAP function in question (ranking, aggregation, etc.), the OVER clause is subject to different usage rules. The following discussion explains the concepts of window partitions, ordering partitioned rows, and window frames in general terms.
OLAP Window Partitions
A window partition is a set of rows that are grouped together for the purpose of applying an OLAP function. The OLAP function is applied to every row, but is calculated with respect to the rows in the partition. If no partition is specified, the OLAP function is computed over the complete intermediate result set.
Ordering Rows within Partitions
The OLAP ORDER BY clause is separate and distinct from the ORDER BY clause that can be appended to any query expression. Typically, the ORDER BY clause for an OLAP function defines the expressions for sorting rows within window partitions; however, the ORDER BY clause can be used without a preceding PARTITION BY clause, in which case the sort specification ensures that the OLAP function is applied to a meaningful (and intended) ordering of the intermediate result set.
This specification is a prerequisite for the ranking family of OLAP functions; it is the ORDER BY clause, not an argument to the function itself, that identifies the measures(s) for the ranking values. In the case of OLAP aggregations, the ORDER BY clause is not required in general, but it is a prerequisite to defining a window frame. The partitioned rows must be sorted before the appropriate aggregate values can be computed for each frame.
In the following example, the window partition is defined by the “Dimension” column. The five rows that contain the value “A” comprise a window, as do the five rows for “B” and “C”. The Meaure column is the input data for an OLAP RANK function; the rows within each partition are ordered by the Measure values. When the RANK function is applied, it is calculated over each partition. The Measure values are ranked 1 through 5 within each partition.
The OLAP ROW_NUMBER function is also calculated in this example, without a PARRITION BY clause, to produce consecutive row numbers for the entire result set.
The SQL for this query would look like this:
Select row_number() over () as row,
rank () over (partition by dimension order by measure)
You can define window partitions according to the values in a single dimension or you can specify multiple dimensions. For example, you could partition the rows based on city and state values or month, quarter, and year values.
For non-ranking OLAP functions, such as aggregations, you can define a window frame. A window frame defines a set of rows within a window partition. When a window frame is defined, the OLAP function is computed with respect to the contents of this moving frame rather than the fixed contents of the whole partition.
The definition of a window frame can be row-based (ROWS specification) or value-based (RANGE specification).
Row-Based Window Frames
The reference point for all window frames in the current row. The SQL OLAP syntax provides mechanisms for defining a row-based window frame as any number of rows preceding and/or following the current row.
In the following example, rows 1 through 5 represent a partition; each row becomes the current row as the OLAP window frame slides forward. The frame in this case is defined as between current row and 2 following, so each frame includes a maximum of three rows and a minimum of one row. When the frame reaches the end of the partition, only the current row is included. The shaded areas indicate which rows are excluded from the frame at each step.
In other words, the window frame in this case imposes the following rules:
- When row 1 is the current row, rows 4 and 5 are excluded
- When row 2 is the current row, rows 5 and 1 are excluded
- When row 3 is the current row, rows 1 and 2 are excluded
- When row 4 is the current row, rows 1 to 3 are excluded
- When row 5 is the current row, rows 1 to 4 are excluded
The following example applies these rules to a specific set of values, showing the OLAP AVG function that would be calculated for each row. The sliding calculations produce a moving average with an interval of three rows or fewer, depending on which row is the current one.
1 A 10 53.3
2 A 50 90.0
3 A 100 240
4 A 120 310
5 A 500 500
The SQL for this query would look like this:
Select row_number () over () as row,
avg (measure) over (partition by dimension)
order by measure
rows between current row and 2 following) as OLAP_AVG
The averages are computed as follows:
- Row 1 = (10+50+100)/3 = 53.3
- Row 2 = (50+100+120)/3 = 90.0
- Row 3 = (100+120+500)/3 = 240
- Row 4 = (120+500)/2 = 310
- Row 5 = (500)/1 = 500
Similar calculations would be implemented for all subsequent partitions (B, C, etc.) in the result set.
Value-Based Window Frames
The previous example demonstrates a row-based window frame definition. The SQL OLAP syntax also supports another kind of window frame, whose limits are defined in terms of a value-based or range-based set of rows rather than a specific sequence of rows.
For example, a frame could be defined as the set of rows with Year values some number of years preceding or following the current row’s year. A specific example might be:
Order by year asc range between current row and 1 preceding
where 1 preceding means the current row’s Year value minus 1
This kind of range specification is inclusive. If the current row’s Year value is 2000, all rows in the window partition with Year values 2000 and 1999 qualify for the frame, regardless of the physical position of those rows in the partition. The rules for including and excluding value-based rows are quite different from the rules applied to row-based frames, which depend entirely on the physical sequence of rows.
Put in the context of an OLAP AVG calculation, the following partial result set further demonstrate the concept of a value-based window frame. The frame consists of rows that:
- Have the same year as the current row
- Have the same year as the current row minus 1:
The SQL for this query would look like this:
select row_number() over () as row,
avg (measure) over (partition by dimension
order by year asc
range between current row and 1 preceding ) as olap_avg
The averages are computed as follows:
- Row 1 = 1999; rows 2 to 5 are excluded; AVG = 10000/1 = 10000
- Row 2 = 2001; rows 1, 4, 5 are excluded; AVG = (5000+1000)/2 = 3000
- Row 3 = 2001; rows 1, 4, 5 are excluded; AVG = (5000+1)000/2 = 3000
- Row 4 = 2002; row 1 is excluded; AVG = (5000+1000+12000+3000)/4 = 5250
- Row 5 = 2002; row 1 is excluded; AVG = (5000+1000+12000+3000)/4 = 5250
Most of the information provided here can be found in various manuals of database systems/BI tools that have implemented SQL OLAP standard. The examples used here are taken from the Red Brick Warehouse SQL Reference Guide with slight modifications.
This entry is primarily a recap of a recent article published by IDC (Carl Olofson) titled "The Third Generation of Database Technology: Vendors and Products That Are Shaking Up the Market". My disclaimer for this entry is that I cannot attach the original article published by IDC (due to copyright). Therefore, I will attempt to summarize and discuss the salient points in the article. It will not be a complete summary or review of the article, due to time and relevance to this audience. At times, I may inject my own thoughts and comments (highlighted). I am hoping to make people aware of some of the competitive products already in the marketplace and others along the way (including IBM and Informix). Part I will cover the 1st and 2nd generation technology, while Part II (the good stuff) will cover the 3rd generation systems.
The premise of the article is that a new generation of DBMS technology is sending a simple message to current generation of DBA's and users that "Everything you know is wrong". These new systems will encourage you to forget disk-based partitioning schemes, indexing strategies, and buffer management. It will embrace a world of large-memory models, multi-core processors, clustered servers, and highly compressed columnwise storage. He further predicts that wihin 5 years:
Most data warehouses will be stored in a columnar fashion.
Most OLTP databases will either be segmented by an in-memory database or reside entirely in memory.
Most large-scale database servers will achieve horizontal scalability through clustering.
Many data collection and reporting problems will be solved with databases that have no formal schema at all.
The First Generation
The first generation of database technology started in the 60's and continued into the 70's. It served two distinct purposes: 1) to enable disparate but related applications to share their data by a means other than passing files back and forth, and 2) to provide a platform for independent data query and reporting that did not require custom code.
At the time, computer programs typically ran in batch mode, so the databases were linked to applications as layers of data organization and indexing between the application code and the file system. Because memory was extremely limited, these systems lacked generalized data access support beyond services that required very explicit data structure navigation, which was coded directly into the application.
Each database management system was different, having its own style for organizing data, and its own set of services or access languages (DDL and DML) that were quite distinct and completely incompatible. This meant that once one developed an application for one of the DBMSs, that application was "locked in". A DBMS migration would require a total application rewrite.
Because of this "lock in" factor, most of these DBMSs are still actively in use today. These products include IBM's IMS, CA's IDMS and DATACOM, Unisys' DMS II (that I worked on), and Software AG's Adabas. All these products have undergone various forms of "modernization" since the 1970's, including the addition of relational interfaces, and support for service conventions such as SOA. The reason that these products are out of favor is due to the lock in quality as well as the cost and complexity of developing and maintaining these databases.
The Second Generation
A new paradigm began to emerge in 1970 with the publication of A Relational Model of Data for Large Shared Data Banks by Dr. E. F. Codd. His premise was that data should be made accessible by managing it in a catalog structured along the lines of mathematical set theory, presented as attributes of relations organized into tuples. IBM then developed a test DBMS based on Codd's ideas, called System R using jargons like "tables", "columns" and "rows", metaphors made familiar by popular spreadsheet products that were delivering computer power to the masses. IBM made relational DBMS seem comprehensible and accessible to ordinary people.
Relational DBMS offer a standard way of modeling and accessing data that could enable users to break out of the dreaded lock in effect. At first, multiple competing languages were proposed for relational databases. IBM took a generalized query language that it had developed for System/R, called the structured query language (SQL), and extended it to include full DDL and DML capabilities.
By the mid-1980's, users resented lock in at not only the DBMS level but also the operating system level. Most systems, whether mainframes or midrange, were driven by proprietary OSs designed and controlled by their computer vendors. Users were increasingly attracted to what were then called minicomputers, driven by Unix. Such systems were called "open systems".
New RDBMS vendors quickly embraced open systems, pushing the idea that users could get cheaper processing and escape lock in at both the OS and the DBMS level. The were fantastically successful, and companies like Ingres, Oracle, Informix, and Sybase emerged as the major players in the late 1980s and early 1990s.
These second generation DBMS products were well designed for the economic constraints of their times. Memory was expensive, and most systems had one or a few processors. So, they used buffer management techniques designed to minimize memory usage. They tended to manage processing threads based on single processor architectures, and they based their internal data structures on optimal layout on dedicated disk drives that were typically direct attached storage using SCSI connections. Some RDBMS products offered cluster configurations for greater scalability, but these would typically require purpose-built hardware provided by the manufacturer: Teradata, Sequent, and Tandem (where I also worked) were examples of vendors of such products.
Since the 1990s, some RDBMS products have been enhanced with various kinds of alternate clustering support that features shared disk support enabled by cluster file systems deployed on network-attached storage (NAS) or storage area networks (SANs). [Mach-11 anyone?] Some also offer caching techniques, 64-bit buffers, and automated database and query optimization, multi-level partitioning, and some degree of parallel query processing, all designed to deliver incremental benefits over the basic product.
Nonetheless, these products are still based on the core desgin principles that drove their early development, i.e. laying out data on disk for optimal performance by scattering data across volumes, partition data to simplify indexes and maintenance operations, and query optimization options that take full advantage of the way the data is organized on storage. In short, the systems are still essentially focused on so-called spinning disks, and with that comes both limitations of internal operation scalability and the bottlenecks represented by the disk I/O.
Everything up to now should be already familiar to the reader. It does represent an accurate view (as shared by other authors as well) of the current state of database technology. Stayed tuned for Part II of this entry.
What is the significance of a moving a dot on a graph? If it is Gartner’s Magic Quadrant for Data Warehouse Database Management Systems which it published on Jan 31, 2013, then it is rather significant.
IBM has obtained the right to distribute Gartner’s report. However, we must reference the link and not send the document around. Here is the link:
To appreciate the movement of the dot, one must compare the Magic Quadrant from last year (Feb 2012) and this year’s (Jan 2013) [courtesy of Glen Zimmerman, IBM]. It shows that Gartner has recognized the fact that IBM’s “completeness of vision”, not only is it in the desired upper right hand quadrant, has moved further right in its “visionary” axis beyond that of Oracle. It also managed to stay ahead of SAP (with HANA), Microsoft and EMC (with Greenplum) in both “vision” and “ability to execute”.
As you will see in the report, which goes into fair amount of detail on each vendor, IBM’s portfolio for Data Warehousing includes, ISAS, Netezza, IDAA on System Z, the newly announced PureData System appliances and for the first time, the Informix Warehouse Edition. More importantly, it points out that the Informix offering is the only in-memory solution for data warehousing and analytics within IBM.
Since this is a blog for Informix technologies, I will expand more on the progress of the Informix Warehouse Accelerator product found in the Informix Warehouse Edition.
Since its release in April 2011, IWA has attracted significant market adoption now spanning some 17 countries around the world and across many diverse sectors such as retail, transportation and government. While the majority of IWA customers are current Informix customers using Informix for OLTP, the product has displaced non-IBM vendors previously used for the customers' analytical applications. Customers can now use Informix as an integrated, cost-effective platform for highly reliable transactional applications as well as for large scale data warehousing and analytical needs. We recognized the game-changing BLINK technology from IBM Research for its in-memory, columnar, highly parallel capabilities and have brought the product to market almost 2 years ago.
We are pleased that Gartner has taken notice of our product and that we played a role, as part of IBM, to “move the dot”.
For information about the upcoming Informix 12.10 release, please register to attend the webcast scheduled for March 5th at https://event.on24.com/eventRegistration/EventLobbyServlet?target=registration.jsp&eventid=571631&sessionid=1&key=6740A7076E0FE9C59EBFDC80C8487D34&sourcepage=register.
This is Brussels covered with snow from one of the floors at the Sheraton Hotel where the SW IOT Sales and Tech Sales Training was held during the week of 2/8/10-2/12/10. SW IOT is defined as the Southwest region of Europe comprising of countries like Spain, Portugal, France, Italy, Belgium, Netherlands, and Luxembourg. Subjects covered by this internal sales training include Infosphere, Data Management, Optim, and others. Representing Informix were Terri Gerber for Sales and I for Tech Sales. I presented sessions on the following: 1) Overall Informix business and recent references, 2) Preview of Panther features, 3) Informix Warehousing and ISAO, 4) Informix High Availability features and 5) Informix Embeddability.
The first day consisted of a number of general sessions where Alyse Passerali (WW DM Sales VP) talked about the overall state of business and key plays for DM in 2010. This is followed by general sessions on how to use "White Boarding" for Sales. White-boarding is a new process to get our Sales teams in engaging potential customers in an interactive dialogue to learn their business problems and demonstrate our solutions. I sat through each of the sales sessions that Terri Gerber conducted (Spain/Portugal, Benelux, France, Italy0 and answered questions whereever I could.
My technical sales training sessions were quite well attended, with some dedicated IT specialists from France, Italy, Belgium and Spain. Jose Manuel Gallud has always been a great help in the region. I met others including Mario Rocco Garasto, Stefan Van den Borre and Dr. Simone Moncini. I was able to use a number of public references, e.g. Cisco, Jalisco Mexico, DHL-Germany, China Mobile, Peapod, Hilton Hotels, and others.
One of the keynote session was regarding the new solution offered by Guardiam, a recent acquisition specializing in Data Governance. Their solutions work with any database as they do "sniffing" of SQL operations before they get to the database. Their solution usually involves a hardware component, but a software only solution also exists. I've confirmed with our own in-house security expert (Jonathan Leffler) that this is a solution we should be discussing with our Informix customers.
Besides the presentations, I had good interactions with many of the account teams resulting in possible opportunities in the region. If I do return to the region at their request, I hope it will have better weather. With the snow in Brussels, the airport was temporarily closed. Just walking around the hotel was difficult due to the snow and windy conditions. But there were some fantastic food. Of course, some authentic Belgium waffle and Belgium chocolate. On my way back to the U.S., I was delayed in Washington DC for 6 hours, not due to snow, but due to mechanical problems on the plane. They finally switched plane and instead of getting me back around 7:30pm, I got in at 1:30am. in SFO. Add another hour of driving to get home after that.
In case you're not one of the approximately 1 Billion people tuned into the World Cup soccer event, there was yet another huge controversy similar to the one back in 1966 between the same two teams. If you don't know what I'm referring to, it is the goal that England should have been given in their match with Germany this past weekend. You can see the actual YouTube segment with this link:
http://www.youtube.com/user/KingUndercover1, but the following picture shows where the ball is relative to the goal line.
Of course, the German press had a whole different view of the event, choosing to depict it as follows, which I do admit to be quite clever. (picture courtesy of Sandor Szabo)
I'm sure the English don't find this to be particularly humorous.
So what does this have to do with Informix? Well, if you recall that
one of our best references in recent years has been the solution
offered by Cairos/IMPIRE of Germany, who uses Informix to provide
real-time sports statistics in the German soccer league (Bundesliga). That solution has been very successful and continues to be used today. However, at the same time that the solution is used for statistics, the idea of introducing the use of a RFID chip in the soccer ball also came up. The information captured by the RFID chip will definitively identify the location of the soccer ball. It would put an end to these types of disputes that has plagued the soccer world for so many years. We have the technology with Informix to solve this problem. An article from the RFID Journal points to additional testing that has to be done, but it is nevertheless available.
However, like some many things, it is not about the technology. FIFA (The world Football Association) is refusing to even consider instant-replay, let alone RFID chip in the ball. In fact, their reaction to the weekend fiasco has been to ban all instant replays in the stadium.
I plan to resume my normal blog of hopefully more Informix pertinent information in another week or so (which happens to coincide with the conclusion of the world cup).
By now, you should have already seen that Informix 11.70 (Panther) has gone eGA.
Here's the link to the official announcement:
Panther is a very significant release containing features in many different areas including:
· Flexible Grid
· Application Development
There is a whole host of other collaterals describing all the features in Panther so I don’t want to duplicate the effort here. However, in line with my blog’s focus on Warehousing, I will highlight the significant Panther warehousing features in this blog entry.
For those of you familiar with the Informix product lines of 7.x, 8.x, and 9.x leading up to the current versions of 10.x and 11.x, one knows that the once touted convergence of XPS code line of 8.x with the IDS code line of 9.x (code name: Arrowhead) did not materialize after the IBM acquisition in 2001. I’m happy to report that as of 11.70, the Arrowhead project is now complete.
The significance of this convergence is that the warehousing capabilities of XPS, which have been very popular and well-received over the years, (and for which many key enterprise customers still depend on today), are now largely available in the IDS product. Customers can now expect the combination of key characteristics, i.e. availability, reliability, ease of admin, OLTP performance, to be equally applicable to warehousing workloads. While the XPS product will be maintained and supported indefinitely, customers now have a viable alternative to having a single enterprise RDBMS handle both workloads, either on a single server or in a cluster environment. If migration were to occur from XPS to IDS, it promises to be the easiest transition than any competitive products on the market due to the close proximity in architecture, utilities, administration, and capabilities between XPS and IDS.
There are two significant areas of increased capabilities in Panther for Warehousing. They are:
- Query processing for Warehousing – Star Join, Multi-index scan, Skip Scan and Light Scan
- Time-cyclic Data management – Ability to manage data organized by defined intervals, e.g. months or years, and to perform fragment operations online.
In this entry, I will first discuss advancements in Query Processing.
Up to now, IDS is not able to take advantage of using multiple indexes on a single table to speed query processing. Take for example that you have a CUSTOMER table containing millions; if not billions of rows (e.g. FACEBOOK’s Customer table should have over 500 million rows). Suppose further that you need to look for Male customers having a Masters degree with a HIGH income level that live in a certain zip code.
The SQL query would look like the following:
SELECT count (customer_id)
WHERE gender = “m”
AND income_category = “high”
AND education_level = “masters”
AND zip_code = “95032”;
IDS would currently process such “low-selectivity” predicates like other higher selectivity predicates by simply choosing the highest selectivity predicate, fetching each record (presumably using an index), then simply matching each record to see if the other predicates qualify. It is not difficult to see that many records will be retrieved and discarded before the resulting rows are returned to the user.
In Panther, the system can drastically reduce the number of rows fetched by deploying bit vector technology. Here, each row of the table is represented by a single bit in a bit vector. For each of the columns mentioned in the query, a separate index is deployed. For each predicate, e.g. zip code = “95032”, a number of bits, each representing a row that qualifies in the table is marked. Once the bit vectors are built for each predicate, these bit vectors can be combined using the AND operator where the resulting bit vector represents the intersection of all the conditions specified by the predicates. This single bit vector contains all the rows that need to be retrieved, presumably much fewer than using the previous method. The following diagram shows this method.
As a further optimization for faster retrieval, the Row IDs (RIDs) are sorted prior to the retrieval of the records from the table in order to facilitate sequential reads from the table. By “skipping” to only the records in the resulting bit vector in a sequential manner, known as SKIP SCAN, we have now further reduced the retrieval time needed.
The benefit of Multi-index scan goes beyond processing a single large table as you will see with the discussion on Star Join below.
Star Join is a join method designed to speed up queries in dimensionally modeled schema, called Star Schema. Dimensional Modeling (and the topic in the book) was invented by Ralph Kimball, also the founder of Red Brick Systems. Red Brick was acquired by Informix in 1998 and is part of the Informix product line. Just like XPS, Red Brick is actively supported and maintained by IBM and still used by hundreds of customers world-wide.
As the diagram below depicts, having multiple dimension tables surrounding a central Fact table gives the impression of a star, and thus its name.
I have already discussed the technologies behind Red Brick so for those interested, please refer to my earlier entries in this blog. Since its inception in the 90’s, other competitors have all implemented Star Joins in some way. Instead of materializing the Star Index, all other products have chosen to do Star Joins “on the fly”. XPS’ method is known as Push-Down Hash Join, which is now implemented in IDS as of the Panther release.
Again, without duplicating a lot of other materials written on this topic, the goal of having a Star Join is to reduce the number of records fetched from the Fact table (similar to the example in the previous example involving the Customer table), but also to minimize the number of records produced as intermediate result during pair-wise joins of multiple dimension tables.
For example, if a query using the Star Schema diagram depicted above involving dimension tables and the Fact table, a Left-Deep Tree involving the Fact table and all the dimensions would result, and would look like the following diagram.
Here F denotes the Fact Table and D1 denotes dimension table 1, D2 is dimension table 2, and so on. For the example, it assumes that the Fact table has 10 million rows with default selectivity of 1/1000, and each of the dimension tables have 10K rows with default selectivity of 1/10. While IDS does have the ability to do Hash Join, for which it is superior to standard Nested-Loop Join for warehousing queries, the intermediate results produced after each join is still prohibitively large as you can see from the example.
In Panther, the Star Join method is now included as part of the IDS’ optimizer costing algorithm. If conditions are met based on statistics of the tables and that foreign keys are defined within the Fact table corresponding to primary keys in the dimension tables, then the Star Join method will be chosen.
Instead of the Left Deep tree, a Star Join in IDS becomes a Right Deep Tree, taking advantage of Multi-Index Scan method described earlier to reduce the number rows being fetched from the Fact table, thus significantly reducing the intermediate result set for each of the Hash Joins performed with each dimension table, as shown in the diagram below.
Beta testing by customer/partners have shown significant improvement with such dimensional queries. Depending on the number of rows in each of the tables and the predicates involved, it is easy to see how performance can double, triple or more. The next blog entry will discuss advancements in time-cyclic data management.
While Informix Dynamic Server (IDS) is the flagship database of the IBM Informix family of products, there are other databases in its portfolio that are actively deployed by customers in various industries. In the IIUG survey completed in February of 2010, it reported that 8.6% of the respondents are running on SE, 5% on SE, 3.3% on XPS and 1.3% on Red Brick. Of course, this represents data from customers who responded. We know from PMRs (bug reports) that there are still significant activities on these products.
The purpose of this article is twofold. First to state these products are still being sold by IBM and more importantly, actively supported. It is true that these products are in “maintenance mode” in that there are no significant new features in their fixpacks, but in most cases, they are certified against the latest OS versions on platforms that they are supported in. Second, following my data warehousing theme in past entries, to take a look at key technologies in both XPS and Red Brick, for which we can still learn from and to obtain ideas for which to incorporate into IDS. In this Part I of the entry, I will concentrate on the Red Brick product.
Red Brick Warehouse
Red Brick was originally founded by Ralph Kimball in 1987. It became a product company in 1990 and over the course of the next 15 years, produced significant technologies for data warehousing. Commonly used industry terms like Star Schema, Star Join, and Dimensional Modeling were first used at Red Brick.
Here are descriptions of some of the key technologies:
Star Join/Star Index
A dimensional model contains fact tables (which hold the quantitative data about a business – the facts being queried) and dimension tables (which hold data that describe the dimensional attributes). A dimension table is connected to a fact table by means of a foreign key reference.
All databases join tables in pairs in what’s known as a pairwise join. Red Brick recognized that they could improve performance by building a “pre-join” index consisting of foreign keys of dimension tables. The Starjoin algorithm allows multiple tables to be joined in a fast, single pass operation to yield fast query times. A key advantage for the Starjoin approach in Red Brick is that as you have more dimensions in a single query, the better is its performance as compared to other systems that deploy the pairwise join algorithm. Most database products in the industry have implemented Star Join in one form or another, but no company has implemented it using Star Index.
[Note that the Push-Down Hash Join feature being considered in IDS’ Panther release provides a similar functionality]
TargetIndex and TargetJoin
TargetIndex and Targetjoins make up Red Brick’s bitmap index technology. Bitmap index (also known as bit vectored index) maintains information by tracking each unique column value in a compact bit representation, with a pointer back to each row in the table with that specific value. Sybase-IQ is another vendor that has broadly touted this technology for many years.
TargetIndex exist on single columns that are weakly selective, i.e. where many rows would return based on that constraint. For example, GENDER column returns two values: M/F. Targetjoin is a join algorithm that uses target indexes to identify lists of candidate dimension rows; rows that exist on each list are then retrieved from the fact table.
[Note that the Multi-column index feature being considered for IDS’ Panther release provide similar functionality]
Vista: Precomputed Views
Those familiar with the concept of Materialized Views or Summary Tables will recognize this feature as such. DB2 calls it MQT (Materialized Query Table). One of the most frequent operations performed in decision-support queries is the calculation of aggregate totals such as monthly and quarterly totals, revenue by product or customers, or other types of grouping analysis. In the absence of aggregates, these queries must read hundreds of thousands or million of rows in order to calculate and group the results, and it must do this each time the queries are run. Red Brick’s Vista solution rewrites queries containing aggregate functions automatically and transparently, giving it “another type of index” to use to improve performance. It also comes with a run-time Advisor that provides a cost/benefit analysis on existing and potential new aggregates.
The aggregate table is defined to Vista via a precomputed view. The precomputed view looks much like the standard definition of a view in Red Brick and defines the aggregate and its relationship to the underlying detail table. Vista uses this, along with the defined hierarchies, to determine whether it will rewrite a query to use the aggregate table.
Vista uses defined hierarchies to enable it to use aggregate tables where the aggregation of the table is not at exactly the level the query requires. For example, if a query requires an aggregation by year by store and an aggregate tables exists with an aggregation by month by store that has been defined to Vista, it can be utilized to satisfy the query. Hierarchies greatly reduce the number of summary tables that have to be created.
Versioning (Query Priority Concurrency)
Red Brick’s Versioning facility provides for real-time updates to a data warehouse by allowing data to flow into the warehouse without affecting query response times of the users. The versioning technique permits data modification transactions to occur on a separate version of the same data, while queries are in process. When the data is available for update, the versioned updates are then made the base data.
This facility may seem strange to those accustomed to working with OLTP databases as data in a database should always be kept up to date. In a DSS environment, however, it may be desirable for specific data (or an entire database) to remain static (e.g. pricing information) for a given period before it is updated. Providing fast and uninterrupted access to data is key to warehouse performance.
Table Management Utility (TMU)
Red Brick’s TMU is essentially a fast loader for the warehouse. What makes it different is that it is able to perform loading, do data conversion, perform referential integrity (RI) checks, build/update index all in a single pass, making it query ready without having separate steps to build indexes or check constraints. There were customers that bought Red Brick simply for its load speed.
The Red Brick TMU has different load modes, all to handle duplicates and new rows. They are: Append, Insert, Replace, Modify and Update. One innovative option unique to Red Brick is something called Auto Row Generation. When loading a data warehouse, it is inevitable that there will be “dirty” rows, i.e. missing dimension rows. Other database products will discard these offending rows. After the initial load, the offending rows must be examined, the appropriate dimension rows must be added, and the previously discarded rows must be reloaded, a very time-consuming process.
The Autorowgen feature gives the user the ability to automatically generate rows in the referenced (dimension) tables while the data is being loaded. With Autorowgen, when a row is automatically inserted into a dimension table, the primary key is populated with the missing value, and the rest of the columns are populated with the default value defined for them.
There are many more features in Red Brick designed for data warehousing that deserve attention. Even with the discussion of data warehouse appliances (in my previous entries), there is a need to enhance the core capabilities of the server itself.
Today marks the 10th anniversary of Informix as part of IBM. While IBM has purchased many software companies since 2001, Informix represents the only other major DBMS (besides DB2) in IBM's Data Management portfolio. Informix has had a proud tradition of innovations, as seen in the attached diagram.
Starting with the Dynamic Scalable architecture which dates back to the early 1990's where multi-threading was inherent and features such as Data Fragmentation came as early as the late 1990's. Beyond that, the product included key replication features for Enterprise Replication (ER), High Availability Data Replication (HDR) leading up to the release of Flexible Grid in 11.70. That release also includes significant warehousing capabilities by completing the "Arrowhead" project that was started prior to the IBM acquisition. With the recent addition of the Informix Warehouse Accelerator, Informix now provides a highly performant, highly reliable database platform for OLTP and Business Analytics workloads with minimal administration.
We have also seen a resurgence of interest in the Extensibility features that have been part of the product since its acquisition of Illustra. The Time-Series capabilities has now found tremendous popularity in the Smart-Metering market resulting in wide acceptance within the Energy & Utilities industry segment. And with the thousands of customers running on 4GL, the recent release of Informix Genero is providing them with new ways to modernize their application environments.
10 years in this industry can be a lifetime and Informix has not only survived, but prospered in these years under IBM. We look forward to the next 10 years.
The purpose of this blog entry is to preview that both IBM and HP will be coming out with formal statements stating IBM and more specifically Informix support for Itanium on HP-UX. Current Informix customers running on HP-UX Itanium should not have any doubts about continuing on this platform. HP has declared its HP-UX PA-RISC support will end in 2013 and is actively motivating its customer base on PA-RISC to migrate to HP Itanium. Informix has supported this migration. Besides the fact that Informix has had a long history of supporting the HP-UX operating system, there are specific features that will make this migration easier.
An Informix customer currently running on HP PA-RISC does not have to unload and reload the data when moving to HP-UX Itanium. It is simply a direct upgrade of the hardware. This will save you a tremendous amount of time/resource for migration.
If an Informix customer uses High Availability Data Replication (HDR), it can choose to have the primary on HP-UX PA-RISC and the secondary running on HP-UX Itanium. The reverse is also true. This provides the customer more options as it transitions to HP-UX Itanium while maintaining the ability for High Availability.
If you are an Oracle customer running on HP-UX Itanium today and wondering what you should do, there is a very simple choice. It will be more cost effective to switch data base than to switch hardware. You can switch to an Informix data base providing Enterprise class OLTP and Data Warehousing capabilities with a fraction of the data base administration needed, and at a much lower total cost. Best of all, you do not have to worry about the outcome the lawsuits between the two companies to decide what to do with your system.
This is a continuation of the entry last week where I summarized IDC’s paper on “The Third Generation of Database Technlogy: Vendors and Products That Are Shaking up the Market” by Carl Olofson. Those interested in Part I should refer to the previous entry on this topic.
The Third Generation
The economics of computing have changed in the 2000s such that multi-core processors are common. 4 and 8 way systems, each with dual core processors, are common for even moderate workloads. 64-bit technology is taking over for enterprise servers, especially for database servers, and memory is cheap. Even though disks are also cheaper and faster than ever before, reading and writing whole records on disk represent a drag on overall system performance. Clever adaptations to reduce I/O help, but don’t really eliminate the problem. Third generation DBMS products began to emerge in the late 1990’s and are now beginning to supplant second-generation products in significant ways.
In-Memory Database Technology
Instead of conventional disk-based RDBMS, which involves mapping buffer contents to segments that, in turn, represent pages (disk blocks), with database keys that must be translated and mapped for the data to be located, in-memory databases simply manage blocks of main memory, using memory addresses as direct pointers to the data. This technique eliminates the I/O drag of disk-based data and also claims to reduce the instruction path length of a typical database operation by anywhere from 20 to 200 times.
An IMDB is not simply a shared memory cache, and its contents cannot be accessed directly by applications. Instead, applications access the data through the services of the DBMS as exposed in an API, which may be proprietary, or most commonly a relational convention such as SQL, ODBC, or JDBC. This ensures that not only the application can’t corrupt the database but the database contents are controlled by the IMDB, which uses various techniques to ensure a consistent view of data to each application, and, in most cases, manages multiple concurrent sessions.
A common misconception regarding IMDB is that it lacks the ACID properties of a transactional database. This is not true. Most IMDB implementations used for transaction processing still have transaction logs for error recovery, and can stream the logs to physically persistent storage such as SSM or spinning disk. A few provide full recoverability by nonlogging means. They also commonly replicate their memory contents to other servers, to provide high availability functionality through failover support.
Examples of DBMS products that include this technology are Oracle TimesTen, Sybase ASE 15.5 IMDB, ENEA’s Polyhedra, GeneroDB from Four Js, Xcelerix IMDB from Frontex, IBM’s SolidDB, eXtremeDB from McObject, and VoltDB.
DBMS servers are using peer-to-peer clusters to provide a hybrid of failover and scalability support, spreading workloads across many servers and exchanging small units of data at high rates of speed to execute database operations. This is an especially useful approach when the workload mainly consists in access to a large number of read-only tables, but a small number of updatable tables, and where the transactions do not require a high degree of data sharing for execution, such as classic data entry operations.
Examples of products that include this technology include ParAccel, VoltDB and (Informix Mach11).
Columnar Data Storage
Storing table rows as blocks with selected indexed columns is a spectacularly inefficient approach for databases that are commonly used for statistical analysis, such as data warehouses. This is because many data warehouse operations scan tables, performing aggregate operations on the data, usually in select columns, and because when they randomly select rows, it is normally for only a few columns in that row, and not for the whole row.
Columnar data storage involves storing tables as blocks by column rather than by row. This offers several advantages for analytic databases. One advantage is that operations on a column of data can be carried out with very few I/O operations. Another is that since the column contains all data of the same type, it can very easily compressed to a tiny fraction of its size by using indexing to eliminate duplicate values and then compressing the values themselves. Once that is done, any random select on the table will result in a very quick result because every column is, in effect, indexed. Finally, the index structures that define the columns can themselves be cross-indexed, further optimizing access.
Examples of products that incorporate this technology include Oracle Exadata Database Machine V2, Sybase IQ, Vertica, ParAccel, and VectorWise from Ingres.
Nonschematic DBMS may be one of the most controversial and least understood of the 3rd generation DBMS technologies, though it is based on an idea that is not all that new. It could be argued that every native XML DBMS is really a nonschematic DBMS, in that it dynamically builds index structures based on the tag structures found in the XML documents that are loaded into it, and uses those index structures to organize the data. (Kevin Brown and I have had several discussions with one of the vendors mentioned below to assess its suitability with IDS)
A nonschematic DBMS is one that stores data as a complex of key-value pairs, building cross-pair structures, such as table rows, based on the order inherent in the data presented to it The user can then use tools to discover structures inherent in the data, and derive a schema from those structures.
Nonschematic DBMS could be used as a tool in preparing data to be loaded into a data warehouse, but would make a poor platform for the warehouse itself. It is best used for managing semi-structured data, such as XML, and for aggregating large amounts of data from many sources for performing search and discovery operations. It seems to have a bright future for a variety of applications that may be offered as services on the Internet and in cloud environments.
Examples of products with this technology include Google’s BigTable, Infobionics Knowledge Server, and Amazon’s SimpleDB. Among general-purpose XML DBMS vendors, the pure-plays include Raining Data (TigerLogic XDMS), Software AG (webmethods Tamino), and Xpriori (XMS). A number of leading relational DBMS products also offer native XML support, including IBM DB2, Oracle, and Sybase ASE.
Cloud computing is based on the principles of utility computing, resource virtualization, and an approach to shared resource management, such as multitenancy, that enables a service to offer to users the illusion of limitlessly expandable abstracted resources (memory, processing power, etc.). To fit into such as framework, DBMS technology must also offer virtualization, horizontal scalability, and multitenancy.
These capabilities can be offered through server and storage resources that are tightly controlled, or through virtualization techniques that cooperates with other external resource management systems. A third approach is to encapsulate these capabilities within a physical environment that is directly controlled by the vendor, offering DBMS functionality as cloud services.
Only a few DBMS products feature dynamic, virtualized hardware and software resource management for scaling processing power or storage up or down without manual reconfiguration. Examples of products that explicitly use this technology include Amazon SimpleDB and Microsoft SQL Server Azure.
Part III of this entry will discuss some of the vendors mentioned so far and what specific products that fit into the categories discussed so far. Still further out, there will be discussion on what IBM and Informix are doing with these technologies.
IWA – A Year in Review
This week marks the 1-year anniversary for the release of the Informix Warehouse Accelerator (IWA). This blog entry takes a look at what we have accomplished and what we look forward to.
First of all, for those not yet familiar with the IWA product, it is an Accelerator designed specifically for data warehousing or analytics queries, that runs transparently behind an Informix Dynamic Server. Users submit their queries either through SQL directly or via their favorite BI tool, such as Cognos, Microstrategy or Pentaho. IDS then routes the query to the accelerator based on a defined data mart schema and returns the answer to the user as if the query was answered directly by IDS. The only difference is that the performance of such queries may be 10 times to 100 times faster than before. Queries that used to run in many minutes or even hours can be returned in mere seconds. This is all done without special tuning, adding of indexes or data cubes. It is ideally suited for Informix customers who may have standard OLTP transactions on IDS, while also needing to run analytic queries either on the same system or different systems. The Accelerator runs on commodity Intel hardware that has hundreds of gigs of DRAM at a very affordable price (e.g. less than $100K).
As the primary measure of “success” with any product, especially a new product within IBM/Information Management/Informix, the revenue received has been more than respectable, certainly measured in terms of millions. (I am not at liberty to reveal the exact numbers). We have a number of customers across different geographies and across different industries, including retail, insurance, hospitality, public sector and Energy & Utilities. More importantly, we have partners that have built their own offerings on top of the IWA product and have successfully closed business on their own. Informix has always been a partner-centric and partner-friendly product, and IWA is no different, especially for SMB customers where low cost is of utmost importance. Through a number of workshops and bootcamps around the world, we have pursued an aggressive schedule in enabling our partners and this will continue throughout 2012.
Everyone in the market these days seems to claim that they can do analytics fast. If you include Columnar database offerings, you have upwards of 10 to 20 products competing in this space. One that has received most publicity is HANA from SAP, which shares many of the features and characteristics of IWA. This is a good thing for us. By raising awareness of the capabilities of an in-memory database specifically designed for data warehousing in the industry, customers will further investigate the differences between IWA and HANA.
We already have competitive numbers (based on real customer PoC’s) against other vendors including Teradata, SQL Server and columnar database vendors such as Greenplum and Ingres VectorWise. I am happy to report that we stand up well against them both in terms of performance and in overall cost of ownership. So we welcome such comparisons.
Since our release a year ago, we have added some significant features. They are listed here with brief descriptions:
- Workload Analysis Tool
Takes the guesswork out of the data mart schema needed to deploy IWA on. You simply run your standard set of queries through the tool and it generates the schema in XML format which can then be used to load data into IWA.
- Data Currency
While an IDUG study shows that 90% of data warehouses are only updated once a day, customers who need more frequent refreshes of the entire warehouse can maintain a 2nd copy of data (in compressed form). By toggling between the two copies, one can achieve high data currency and still attain the great performance provided by IWA.
- Cluster Support
IWA can run on a standard SMP machine or on a cluster that is typical with a Blade Server configuration. Here, IWA has access to even more cores and memory across different nodes as opposed to a single node system. IWA efficiently partitions the fact table data while replicating the dimension tables for optimal performance. The number of coordinator and worker processes can be scaled up accordingly.
- Partition Refresh
In the upcoming release of 11.70xC5, we are adding two significant features to IWA. While the current version of IWA requires a complete reload of data when data is refreshed on the host (i.e. IDS), the partition refresh feature now allows users to selectively update one or more partitions. For example, if you have a terabyte size data warehouse already loaded in IWA and updates were made to 10 gigs of data in certain partitions, you can choose to only reload the 10 gigs of changed data. Thee load time for such a refresh is then greatly reduced. In other competitive systems, this is often referred to as Incremental Update or Trickle Feed.
- Loading IWA from Mach11 Secondaries
Taking advantage of the Mach11 capabilities in IDS, one can deploy IWA in a mixed-workload environment by dedicating the primary node for OLTP transactions and having a Mach11 secondary handle the analytic queries. The secondary can be a HDR, a RSS or a SDS node. This secondary node can now handle the loading of data into IWA and can also host the IDS optimizer which determines whether to route a particular query to IWA. The ability to run mixed-workload effectively in a single system is one of the “holy-grails” of database management.
Our goal is to continue to build a base around our Informix customers with analytics needs. We also want to reach beyond the current base into the so-called “white-space”. While there are other overlapping products within IBM’s Information Management portfolio, we believe that the state-of-the-art in-memory technology in IWA together with the strong OLTP capabilities within IDS, provides a unique offering at a price point that is second to none.
For more information, please send a comment to the blog or email me directly at firstname.lastname@example.org
In-Memory Databases Go Mainstream
While much have been written about the subject, including several blog entries I have written here, it takes a database market share leader like Oracle to make it mainstream.
In the super-hyped Oracle OpenWorld event in San Francisco this week, the headline news is the announcement of Oracle 12c and the In-memory option. Several articles have already been published in the press, including:
1) Oracle Open World 2013: Oracle launches in-memory database search
2) Oracle's Ellison promises 'ungodly' database speed with new in-memory option
3) Oracle's Ellison Tries To Outmaneuver SAP Hana
Based solely on publicly available information from these articles, I will point out a few similarities as well as differences from our own offering, namely the Informix Warehouse Accelerator.
Oracle claims its in-memory option requires no SQL changes and no application changes.
100x performance improvement
Runs on both SMP and Cluster
Dual storage approach whereby data is saved in both rows and columns
With the Informix/IWA combination, we store OLTP in its original row format in the Informix database, while in-memory processing is done using a columnar approach. Difference is that IWA never stores the columnar data on disk. HANA (from SAP), in contrast, uses only columnar storage for both OLTP/OLAP processing.
IWA has shipped its product over 2 ½ years ago. We have customers in production today using the technology.
While Oracle has just announced this in-memory option, it has published no Beta or production deployment data.
IWA runs on commodity Intel hardware, for both SMP and Cluster, costing orders of magnitude less than proprietary hardware
Oracle also announced the Big Memory Machine costing some $3 million dollars. It is based on the Sun Sparc M6 chip. This does not include software licensing and maintenance cost.
IWA is integrated with the Informix Flexible Grid architecture whereby query acceleration can run off its primary or secondary servers with no impact on OLTP processing.
IWA is a query accelerator. This means that users can choose to accelerate portions of the database where OLAP performance is most critical. Only parts of the entire database may be impacted.
Regardless of tradeoffs between different approaches, the most important point is that customers should feel that in-memory database is no longer an esoteric technology to watch out for. It has arrived and it is now mainstream. Informix has a strong offering and it has already proven itself with customers.
Carlton Doe offers an updated 1-day Proof of Technology (PoT) on IWA.
Its objective is as follows:
In this Proof of Technology, participants will install and configure the accelerator server. They will learn how to analyze queries to determine what tables and columns should be accelerated for greater performance. Finally, using a combination of graphical and command line utilities, they will create, deploy and load marts of accelerated data. They will execute queries against both non accelerated and accelerated data to see the dramatic and exponentially better response times Informix Warehouse Accelerator provides.
For information about this PoT, contact Carlton at 972-561-6103 or email@example.com.
There are numerous Informix events around the world where IWA is discussed. An upcoming webcast (Oct 1) by one of our partners, Advanced Datatools can be found at:
Modified on by fredho66
Query Acceleration for Business using Informix Warehouse Accelerator
A draft IBM Redbooks publication (Planned Published date of November 2013)
IBM Informix Warehouse Accelerator (IWA) is a state-of-the-art in-memory database designed to exploit affordable innovations in memory and processor technology and trends in novel ways to boost query performance. It is a disruptive technology that changes how organizations provide analytics to its operational and historical data. Informix Warehouse Accelerator leverages columnar, in-memory approach to accelerate even the most complex warehouse and operational queries without application changes or tuning.
This book provides a comprehensive look at the technology and architecture behind the system. It contains information about the tools, data synchronization, and query processing capabilities of Informix Warehouse Accelerator, and steps to implement data analysis by using Informix Warehouse Accelerator within an organization.
Table of contents
Chapter 1. Introduction to Informix Warehouse Accelerator
Chapter 2. Designing data marts
Chapter 3. Designing and sizing an Informix Warehouse Accelerator environment
Chapter 4. Informix Warehouse Accelerator installation and configuration
Chapter 5. Creating Informix Warehouse Accelerator data marts
Chapter 6. Query execution
Chapter 7. Managing Informix Warehouse Accelerator data mart
Chapter 8. Informix Warehouse Accelerator server administration
Chapter 9. Use of Cognos with Informix Warehouse Accelerator
Chapter 10. Proof of concept for Informix Warehouse Accelerator
Appendix A. Tools for Informix Warehouse Accelerator
IBM has obtained distribution rights to the following white paper published by Bloor Research (author: Philip Howard) titled "IBM Informix in hybrid workload environments". It discusses key features such as Flexible Grid, Time-Series support and Informix Warehouse Accelerator which makes Informix ideal for Analytic applications.
There was a recent InformationWeek webcast sponsored by IBM on the topic of
Combining OLTP and Extreme Data Warehouse Performance in which Lester Knutsen of ADT and I discussed the Informix Warehouse Accelerator as part of the release for the Informix Ultimate Warehouse Edition. The webcast was attended by over 350 people and comments were overall quite positive. In this blog entry, I will attempt to answer the questions that were submitted. I consolidated some of the duplicate questions. Questions are answered in no particular order.
For a replay of the webcast, you can go to the following link:
For more information about the product, e.g. white paper and other collateral, you can use either of the following links:
Note that in the upcoming International Users’ Group conference held in Overland Park, Kansas, there will be additional dedicated sessions on this topic. For more information on IIUG Conference for 2011, please refer to the link:
Q: What is MOLAP/ROLAP/HOLAP?
A: MOLAP stands for Multi-dimensional OLAP, ROLAP stands for Relational OLAP, and HOLAP stands for Hybrid-OLAP.
MOLAP is the most common of the three in building multi-dimensional “cubes” which is essentially summary tables at the BI tools layer. Cognos uses a product called PowerPlay for its MOLAP product, and Hyperion has a similar thing. Once these cubes are built, queries can be answered very quickly if the cubes have the same dimensions used in the query. It can take considerable time to build the cubes, usually measured in hours. ROLAP essentially drills through to the backend relational database for the queries. And HOLAP, which is offered by products such as Microsoft SQL Server, does either or both, depending on the query. There are lots of materials written up on how this is done. With Informix IWA, we eliminate the need to build these cubes.
Q: Can IWA be also used to accelerate OLTP queries, and not only OLAP ones?
A: IWA's sweet spot, as mentioned in the presentation, is for OLAP queries against data warehouses where there is at least some form of Star Schema set up. That is, there needs to be a Fact Table with surrounding dimension tables. For OLTP queries, chances are that the transactions are short and would not need the type of acceleration provided by IWA, where it is scanning potentially millions and billions of records. However, if the OLTP schema can be defined in such a way that a data mart can be offloaded to IWA, then the queries will be accelerated. In some situations, e.g. operational data store, both OLTP and warehouse workloads share the same schema. In that case, IWA can be used to accelerate the OLAP queries while co-existing with OLTP. This is the basis for the mixed-workload I was referring to during the webcast.
Q: Is there any formula to determine how much memory you need for in memory database?
A: The amount of memory needed for IWA is dependent on the amount of compression that can be done on the data. Since we use columnar technology, data can be compressed quite well, as data in columns are usually quite similar. We use a rule of thumb of 3:1, i.e. 300 GB of table data means about 100 GB of main memory. There is no need to account for index or temp table space as IWA does not need them. Allow for additional memory if the same system is being used for IDS and other processing on the same system.
Lester's testing showed that he went from 110GB of raw data to 24GB in memory, which is much better than a ratio of 3:1.
Q: Can get special build to get along with IDS11.5 for iWA?
A: IWA is part of Informix Ultimate Warehouse edition and needs to run on 11.70xC2.
Q: If query using temp table, does IWA works in the same way (fast)?
A: Yes, for BI tools can use temp tables, queries that were run to retrieve the result onto the temp tables can be accelerated. The temp table processing itself is done on the IDS side, not in IWA.
Q: What is the largest size database you can have with the in memory db appliance? How do you handle high availability?
A: For the 1st release of IWA, the configuration in a single IBM x-Series machine, it has a maximum main memory capacity of 1.5 TB. With a 3:1 compression ratio, it would mean that the current maximum limit is about 4.5 TB of raw data. X-series is releasing a new system this year that will double that capacity. Note also that IWA can run on any Linux/Intel box, not necessarily from IBM. In an upcoming release, we will support running IWA on a Blade Server, thus allowing maximum scalability to multiple nodes (or blades).
In terms of high availability, the set of disks on IWA stores an in-memory image of what is in-memory for recovery purposes. This set of disk (much smaller than those that are stored on the Informix side) is only used to provide High Availability, not the actual database itself.
Q: How does the performance compare to Oracle Exadata appliances?
A: We have not done direct comparison to Exadata, but one of our large customer prospect is running a Proof of Concept (PoC) comparing the two. We will have much better data points once that are completed. We do believe that our numbers will be very competitive, and also at a much lower cost as compared to Exadata.
Q: IWA is based on MPP architecture; what is added at hardware level to make it MPP?
A: Actually IWA is not based on an MPP architecture. It is based on SMP. In fact, the recommended configuration for the 1st release is a single Intel box (with multiple cpus/cores). However, the original IWA design allows it to be deployed on a Blade Server with multiple nodes. We intend to support that configuration in an upcoming release. As I mentioned in the talk, we don't need extra hardware to do I/O for us, just the Intel box for which IWA resides. The fact that we use columnar technology and SIMD parallelism to scatter our "cells" to the different cores within the system provides us with a very distributed platform similar to what a MPP system would do. In a way, you can say that we combine the best of SMP and MPP onto a single system.
Q: Does IWA support locales other than default?
A: It is correct that as the 1st release, only the default locale is supported as indicated in our manual.
Only the en_us.8859-1 locale is supported.
We recognize this limitation and are working to add more locales for the upcoming fixpack. If there is interest for you company or an end customer, testing using the default locale will demonstrate the performance gain which can lead to successful implementation once we support the additional locales.
Q: Can you comment on costs? Of course it can vary a lot depending on what one want to do?, However can you comment on costs?
A: Cost is highly dependent on the amount of data that you plan to have in memory. As I mentioned during the talk, the rule of thumb is 3:1 so 300GB of raw data in your warehouse would require about 100GB of main memory. There is no need to account for index or temp space since IWA does not require them. The current eX5 system from IBM can contain up to 1.5TB of main memory at a very affordable price.
Q: What vendor/products is IBM positioning this Suite of products (IDS & IWA) against?
Q: How is IBM planning on Marketing/Selling this Suite of products (IDS & IWA) to the General public, the Non-Informix Community?
A: Actually, the two questions are sufficiently related that I think I can answer them together.
IWA is primarily targeted towards the Informix base. Obviously, the solution is generally applicable to non-Informix customers, but we believe there is sufficient need and interest within the customer base to generate sufficient demand. Our target is data warehouses that are less than 5 TB. As mentioned in the webcast, most analytics are done on warehouses of much smaller sizes regardless of how large the enterprise warehouse may be.
Q: Is there a problem\conflict if using CISCO\Riverbed (or any other) compression\acceleration with IWA
A: No, there is no conflict between Wide Area Network compression/acceleration with IWA. Once data comes over from the network to the IDS database, it is in uncompressed format. You may choose to apply IDS compression to store the data on IDS, but data in IWA uses completely different compression techniques and is not affected by other forms of compression.
Q: Is this solution is comparable to Oracle's Exadata?
A: I believe I addressed this somewhat during the call. The IWA solution from Informix is very much comparable to Oracle's Exadata approach, in that it tries to provide a substantial increase in performance for warehousing workloads. The architecture and approach is quite different because instead of utilizing another layer of hardware to perform I/O from the database, IWA eliminates I/O by using pure in-memory technologies, extreme compression and predicate evaluation with compressed data. This is how we can achieve an order of magnitude speedup at a fraction of the cost/admin required by standard relational systems.
Q: The data mart will still exist in the database and then loaded into memory?
A: Yes, database/datawarehouse resides on IDS, the data to be accelerated is loaded by IWA (by IWA). Then all subsequent queries that can be accelerated will be routed automatically to IWA and the answers will be returned to IDS. Applications that access the IDS database, e.g. for OLTP, will continue unchanged. Only the affected data mart (with a defined Fact Table) will be re-routed to IWA.
Q: Is the data loaded into IWA or meta-data gets loaded into IWA? If data is being loaded then is more storage needed?
A: Both IDS and IWA contain metadata for the "data mart" that's defined to be part of IWA. The data warehouse is stored on the IDS side, but not on IWA. It is only loaded into memory. So there is no requirement for additional storage on IWA. The disks shown on IWA are only for recovery purposes in case there is a failure to the IWA box. It is a memory image of what is in IWA.
Q: Putting entirely in memory? And what happens when there's a machine failure? No threads that flush to disk and persistence? Not smart.
Q: You said it was an 'all in memory only' solution, if true, what do you handle system failure, and disaster recovery planning.
A: These first two questions are referring to the same thing. The disks attached to IWA (not the ones on the IDS side that contains the database) are used for machine failure. It contains a memory image of what was loaded in memory. The amount of disk needed on IWA is limited to size of main memory configured, usually much smaller than the data base itself.
Q: So if I understand the diagram... The Warehouse accelerator is a separate box. All in memory solution. The timing listed in Fred's slide show time for queries. But how/when is the data loaded in to the box and what's the time to build the 'In Memory' warehouse that the query is run against?
A: The Beta customer (large shoe retailer) loaded 1 billion records of the Fact Table and two dimension tables (one with 3 million and another with about 330K rows) in about 30 minutes. The customer's comment was that since it used to take him 45 minutes to run a report, he gets his ROI within 2 queries. There are configuration parameters that will help with optimizing the load time.
Q: 512GB of memory? And what's the cost? What's reasonable? I thought the solution was all 'in memory' so why the 6 300GB 'fast' disk?
Q: Again, what's reasonable with respect to price.
A: Answer to both questions - for the recommended configuration using a System X server, the cost of the server including 512GB is around $50K to $60K, for which I believe is a very reasonable price. The software licenses for IWA depends on whether you are coming from the Informix Ultimate Edition ($160/PVU) or if you are buying a new license ($463/PVU). The 6 300GB disk is for recovery purposes as mentioned above.
Q: Will there be any Sessions, Tutorials or Labs for IWA and DW/BI offerings at the upcoming Informix conference in May?
A: There are several dedicated sessions for IWA at IIUG in May. We are also planning a live demo for accessing IWA in the Cloud. There will also be business partner sessions on this topic.
In this final part of Carl Olofson’s paper on “The Third Generation of Database Technology”, I am summarizing the key points in the remainder of his paper, taking liberties to remove much of the detailed descriptions of each vendor, and jumping to his thoughts on Market Strategies, Technology Assessment and Essential Guidance.
It should be apparent to the reader by now that I chose to spend the time on this paper (versus many other papers available on the subject) because it validates a lot of the work that we are currently pursuing within the Research and Development teams at IBM. In addition to the incremental functionalities in each new release, we have to pay particular attention to where the market is headed, both in terms of customer needs and in terms of technology advances. Many of the technologies discussed are already available in the IBM Information Management portfolio, as you will see them mentioned below, and others are being actively developed, though not ready for public announcement. The challenge will always be getting the multiple pieces properly integrated and getting them through the product release process.
Vendors and Products of the “3rd Generation Database Technology” (in Alphabetic Order)
Only the vendor and product names are listed here. You can find technologies associated with each vendor in Part II of my blog entry on this topic.
Amazon – SimpleDB
ENEA – Polyhedra
Four Js – Genero DB
Google – Bigtable
IBM – SolidDB, Streams
Infobionics – Knowledge Server
Ingres – VectorWise
McObject – eXtremeDB
Microsoft – SQL Azure
Oracle – TimesTen, Exadata Storage Server
ParAccel – PADB
RainStor – RainStor cell-based RDBMS
Sybase – ASE IMDB
Vertica – FlexStor
The author believes that fundamentally different approaches to data definition, storage, manipulation, and delivery is needed to provide for more, better, faster, cheaper database management. The desire to acquire and use business analytics in moment-by-moment decision making, accumulate and reconcile large amounts of enterprise data for reporting and analysis, streamline operations with better and more precise data collection and movement all result in demand for DBMS technologies that deliver orders of magnitude better performance, with much higher reliability, than what the leading conventional products can offer.
As stated in previous parts of this blog entry, the author believes that the DBMS industry has long been held back by the relational paradigm, which has proven to be both a blessing and a curse. The blessing is that its simplicity and broad applicability have enabled DBMS technology to become the standard way that business application is stored and managed, which brings order and manageability to both the data and the applications that use it. The curse is that relational databases are unable to hold semantic metadata or directly represent data organization concepts such as multidimensionality, containment, derivation, recursion, or collection. This limitation can force DBAs to store data that reflects such concepts in arcane combinations of cross-reference tables that require multiple complex joins to navigate, and to encapsulate the management of such table relationship combinations in program code or stored procedures both of which, in the absence of detailed documentation, tend to mask the actual nature of the way the data is logically organized. [Though readers familiar with IDS’ extensibility features will know that IDS has solved some of these problems for many years]
The use of Web services and the growth in event-driven architectures (EDAs) are diminishing the need for SQL support in favor of the publication of data services that can be triggered or invoked in a SOA or EDA context. Freeing database management from the limitations imposed by SQL, even at the interface level, can further accelerate the development of DBMS technology that is, from the current perspective, truly revolutionary.
Essential Guidance (Carl Olofson’s conclusions)
The DBMS world is dominated by second-generation disk-based RDBMS technologies today. The third-generation technologies will probably not displace the existing disk-based RDBMS products overnight. In fact, it is more likely that the leading RDBMS products will evolve to include some of these technologies, and the leading vendors may acquire third-generation DBMS companies to add to their data management portfolios.
Third-generation technologies will arise in a market that acknowledges that one size truly does not fit all, and that some existing RDBMS technologies, perhaps with third-generation enhancements, remain perfectly appropriate for a range of OLTP and data warehousing workloads. Other workloads, such as those that depend on the rapid capture and processing of streaming data, involve the collection and analysis of vast amounts of heterogeneous data, or support the highly variable demands of the public cloud, will require these emerging technologies, often deployed in specialized, workload-specific ways.
As a first time blogger, I'm wondering why anyone would want to see another blog. Even for the Informix loyalist, there are many other more knowledgeable and technically savvy people that have blogs on Informix products and other notable developments in the Informix world. So let me start by saying a few words about my past experiences. If that is in any way interesting, perhaps there are some common areas that warrant discussions.
I have been working in the computer industry and primarily in database development in one form or another for some undisclosed number of years. Places I've worked for include Burroughs (for which most people have never heard of), Sperry Univac, HP (for a very short time), Sun (yes, the Oracle one, also for a very short time), Tandem (14 years), Informix (Red Brick for about 3 years) before being acquired by IBM in 2001. Since then, I've managed teams on DWE, Cube Views, IDS Engineering, and in my current position, as manager of the Competitive Technologies and Enablement team.
My area of interest (not necessarily area of expertise) include DBMS, Fault-tolerant systems, Data Warehousing, Business Intelligence, Clusters and High Availability. Over the years, I have been fortunate to work with some real experts in the database area, including 3 of the original 7 authors of the System R paper, who all happenned to have started at IBM Research but were working at Tandem at the time. And at Red Brick, a company originally founded by Ralph Kimball, is where I really understood how data warehousing works.
Feel free to comment if you are interested in any of this. I will discuss more on what the Competitive Team works on in the next comment.
Gartner published a report of this title on January 16, 2014 where descriptions of different vendor offerings were provided. Besides IBM, the big players are SAP, Oracle, Microsoft and Teradata, but there are also a number of smaller vendors, such as Aerospike, Altibase, Kognito, PartStream and VoltDB.
Gartner considers In-Memory DBMS as one of the Top 10 technology trends impacting infrastructure in 2013 and beyond.
Gartner defines two types of IMDBMS:
Operational IMDBMS for managing transactions with both relational and NoSQL data.
Analytical IMDBMS include in-memory column-store DBMSs and in-memory MPP row-store-based technologies. It allows for processing without the need for prepared aggregates, summaries and cubes. It simplifies administration and is often combined with low-latency loading in the database to perform real-time analysis.
Furthermore, the promise of in-memory DBMSs is to enable hybrid transactional and analytical processing platform (HTAP) use cases without having to move the data from one to the other.
Under the section for IBM, it lists 3 products: solidDB, BLU Acceleration (on DB2) and Informix Warehouse Accelerator.
“Informix Warehouse Accelerator (IWA), released in March 2011, also leverages an in-memory, columnar data store to transparently accelerate analytical queries processed by Informix. It is a software appliance that is not tied to particular hardware. Subsequent releases of IWA since 2011 added support for clustering, and more recently, for accelerating transactional workloads including a Trickle-Feed feature that updates the transactional data for near real-time analysis”.
IWA is nearing its third anniversary at the end of March. I believe it is significant not only because of its product revenue success and customers gained across many industry sectors around the world, but also of its recognition by market research companies like Gartner and others. Judging from other vendor products published in this Gartner report, IWA is in good standing in comparison with a number of these products.
2013 was a good year for IWA. We are seeing adoption in Retail, Transportation, Health Care, Public Sector and Finance sectors, among others. We are also seeing "return" customers who are adding to their existing base. The underlying Intel x86 architecture is further expanding to accomodate more DRAM and more cores per socket, making the value proposition for IMDBMS even more attractive.
And as we embarked on 2014, we are reaching to enable an even higher variety of scenarios (and types of data) that solve real problems for our customers. Look for more blog entries on this topic in the coming months.
In building a set of messages towards the upcoming IBM Insight (formerly IOD) conference, this blog entry is being written as part of that effort (in collaboration with IBM Marketing):
The Internet of Things is changing our world with the pervasiveness of connected sensors and devices, expanding the volume of captured data exponentially. This new computing model requires us to rethink our data management strategies in several ways. Since data is captured from numerous devices across multiple data types, succeeding in the IoT space requires the ability to securely combine structured, unstructured, Time-Series, spatial and other sensor data into a single source of intelligence. We need to make sense of this huge influx of data by aggregating and analyzing data locally, at the gateway level, to use it in meaningful ways. This can only be achieved by using an intelligent database with reliability, flexibility, performance and simplicity to handle real-time and context computing on the edge. Embedding an enterprise-level database, like Informix, coupled with the power of a warehouse accelerator, can optimize business results with speed of thought insight.
There are many ways sensor data can be used. Data can be used for real-time analytics to gain intelligence and respond to events as they happen. You can use historical analysis to look through record history for trends, analyze collections of sensors for correlation and formulate hints and suggestions based on usage and patterns. As tremendous growth opportunities emerge, so do challenges. Sensors collect and forward data for a single measurement, but consumers will have many devices and want a consolidated view. These devices generate a huge amount of data and limited space will create network and latency concerns. In addition, the variety and volume of data makes it hard to locate data and perform analytics that join different kinds of data together. Databases solve these problems when data is organized locally in a compact form that is easy to search and use. Simple SQL and JSON application development interfaces provide a flexible schema without requiring upfront definitions of their types. IBM Informix hybrid capabilities provide management of SQL and NoSQL/JSON data seamlessly, in one database.
Warehouse accelerators improve the performance of traditional data warehousing queries. In the “sensor analytics zone” within IBM’s IoT architecture, IBM Informix Warehouse Accelerator speeds complex queries for relational and Time-Series data. You can scale on a cluster and provision both IDS and IWA in the cloud to analyze much more data from different sensors and gateways. IWA can perform analytics of stored sensor data, up to 1000x faster than Time-Series alone. Informix is leading the way with unique capabilities to harness IoT data at the edge and I’ll be talking more about them at Insight 2014.
First webcast is sponsored by InformationWeek titled "Now Everything You Have Ever Wanted in Business Intelligence is Possible".
This webcast is presented by Craig Humphreys from Preferred Partners and I from IBM.
Date: Tuesday, February 28, 2012
Time: 9:00 AM PT / 12:00 PM ET
Duration: 60 minutes
The new Informix Warehouse Accelerator technology from IBM has completely transformed business intelligence. Now, incredible solutions are not only possible, they are affordable and easily achieved.
This webcast will present examples of real world solutions that could not exist without this software-only technology. You'll hear how Informix Warehouse Accelerator leverages a columnar approach to accelerate warehouse queries up to 100 times or more, including:
- How one European government agency reduced its complex query times from 90 minutes to just 67 seconds
- How one large U.S. retailer reduced query times from 20 minutes to just four seconds. That's over 300 times faster
- Design and deployment details will also be discussed, demonstrating just how simple it is to create these solutions
2nd Webcast is sponsored by Advanced Data Tools:
Advanced DataTools will demonstrate queries on IWA from a table with a billions rows.