IBM Business Analytics Proven Practices: Dynamic Cubes Hardware Sizing Recommendations

Product(s): IBM Cognos Dynamic Cubes; Area of Interest: Performance

This document discusses some of the sizing variables involved with Dynamic Cubes and provides some high level estimates to help guide the user in selecting the proper resources for their environment.

Share:

David Cushing, Software Engineer, IBM

David Cushing is a Software Engineer at IBM Cognos with over 24 years of experience. He has been responsible for the development of data access technologies in a range of products including Cognos Impromptu, Cognos ReportNet, Cognos 8 and Cognos 10, as well as IBM Cubing Services and most recently, IBM Cognos Dynamic Cubes.



Joseph Ng, Advisory Software Engineer, IBM

Joseph Ng is an Advisory Software Engineer at IBM Silicon Valley Laboratory. He has over 12 years of software development experience and has spent the last 8 years at IBM. He was a member of the development team for IBM Alphablox and InfoSphere Warehouse Cubing Services and is currently a developer for IBM Cognos Dynamic Cubes. He holds a Bachelor's degree in Computer Science from the University of California, Berkeley, US.



Igor Kozine , Software Engineer, IBM

Igor Kozine is a Software Engineer at IBM Cognos with nearly 10 years experience and is currently Performance and Scalability Verification Lead for the Dynamic Cubes team based in Ottawa, Canada. Since joining IBM in 2003, he worked on number of other IBM products and technologies including Cognos Planning and Finance, data access and recently focusing on Dynamic Query and Dynamic Cubes.



Dmitriy Beryoza, Software Engineer, IBM

Dmitriy Beryoza is a Software Engineer at IBM Canada with over 20 years of experience. He has been working on a variety of data access technologies including Dynamic Query and Dynamic Cubes.



Tom Jacopi (tjacopi@us.ibm.com), Replication Software Engineer, IBM

Tom Jacopi is a Software Engineer at IBM Cognos with over 28 years of experience. He has been responsible for development of the in memory aggregate feature of Cognos 10.2 as well as OmniFind Yahoo! Edition and IBM Data Replication products.



16 November 2012

Introduction

Purpose

The purpose of this document is to provide guidance for determining the minimum hardware requirements to provide an acceptable level of query performance from Dynamic Cubes. After a brief description of the Dynamic Cubes architecture, the document outlines the information that should be collected prior to determining the hardware requirements. It then provides a simple means of using this information to obtain the CPU core, memory and hard disk requirements for Dynamic Cubes.

This document also provides a detailed description of how hardware requirements can be more accurately computed for a particular application and provides additional background to assist in understanding the suggested guidance.

Applicability

This document is intended only for IBM Cognos 10.2. This document is not intended for previous versions because Dynamic Cubes is only available as of 10.2. In addition, later versions may have different hardware requirements. For further assistance or clarification please contact your local IBM Services group for assistance in performing a proper requirements gathering.


Overview Of The Dynamic Cubes Architecture

Dynamic Cubes exist within the Dynamic Query Mode (DQM) server. Only a single DQM server may exist under a single Dispatcher and it services queries from all of the Report Servers that reside under the Dispatcher. All of the cubes that are configured to run on a single Dispatcher must have all of the necessary hardware resources (cores and memory) available to the DQM server for the cubes to operate efficiently.

With Dynamic Cubes, the DQM server in essence becomes an in-memory repository of data. Because of the volumes of data that the DQM server is required to store and process, additional CPU cores and memory are required to support Dynamic Cubes over and above what is typically required for a DQM server. A DQM server will make use of available cores to improve Dynamic Cube performance and will make use of available memory, though there may be practical limits to the amount of memory that can be effectively used for Dynamic Cubes. This will be discussed later in this document.

It is very important to note that the sizing recommendations here do not account for the resources required for the Dispatcher or the accompanying Report Server processes.

There are three pieces of hardware which need to be sized for Dynamic Cubes - CPU cores, memory, and hard disk space. CPU cores are required for two purposes - to support concurrent execution of user queries and to allow Dynamic Cubes to perform certain internal operations in parallel. Memory is required to contain all of the dimensional members of a cube in memory and to provide adequate space in memory to store data that is retained in its various caches. Finally, hard disk space is required for the result set cache associated with each cube.


Collecting Information About Your Application

Prior to estimating the hardware requirements for Dynamic Cubes, it would be helpful to gather the following information (listed in order of importance) or at least obtain estimates for each item below. If you need to estimate, it is usually better to over-estimate - if you under-estimate, performance could suffer.

  • The total number of cubes being deployed. If possible, identify which cubes are virtual. If unsure, assume all the cubes are base cubes and none are virtual.
  • For each cube, the number of named users granted access to the cube.
  • For each cube, the number of members in total across the two largest dimensions.
  • The number of rows in the fact table.
  • The number of measures.
  • The number of dimensions.
  • If there are virtual cubes, determine if any of the cubes upon which they are based are directly accessible or not.

High Level Sizing Recommendations For Individual Cubes

If you need only to make a high level estimate of the hardware requirements for Dynamic Cubes or cannot perform more detailed estimates based on the information contained later in this document, estimates have been provided for four broad categories of cubes in the table below.

Table 1 - High level sizing recommendations
ConfigurationDescriptionNumber of members in 2 largest dimensionsNumber of named users
SmallDevelopment environment scale or small Line Of Business application. Small number of concurrent users and small data volume. 600,000100
MediumSmall to medium sized, enterprise-wide application. 3,000,0001000
LargeLarge enterprise, divisional application, accessing large data volumes.15,000,0005000
Extra largeEnterprise-wide user access, core application, accessing very large data volumes at day, consumer, product (SKU) level of data.30,000,00010,000

Taking into account that the number of named users of an application and the amount of data within it are distinct from one another, the Table 2 provides a quick reference for the combination of small, medium, large, and extra large user and data volume combinations. For the purposes of the guidance provided here, the assumption is that a cube contains a total of 12 hierarchies.

Table 2 - CPU core, memory and disk space recommendations
ConfigurationCPU coresMemoryDisk
Small1 - 43 GB1 GB
Medium4 - 812 GB10 GB
Large8 - 16100 GB50 GB
Extra Large16 - 32300 GB100 GB

Figure 1 is a graphical representation of the data in Table 2.

Figure 1 – CPU core, memory and disk space chart using the information from Table 2
Figure 1 – CPU core, memory and disk space chart using the information from Table 2

Detailed Sizing Recommendations For Dynamic Cubes

Number of CPU Cores

The number of users can be expressed in three ways:

  • Named users are all people who have access to the IBM Cognos 10 BI solution.
  • Active users are a subset of the Named user community who are logged on to the system and may or may not be interacting with IBM Cognos 10 BI system at any given time.
  • Concurrent users are a subset of Active user community who at a given time are either submitting a request or already have a request in progress.

For the basis of the document, the assumed relationship is that 100 named users equates to 10 active users which in turn equate to a single concurrent user. The notion of concurrent requests is discussed at the end of this document as way of more accurately estimating hardware requirements for Dynamic Cubes.

As the number of users increases, so does the number of CPU cores required to process queries concurrently. Dynamic Cubes can take advantage of additional cores to perform internal operations in parallel, the result of which is that individual queries perform faster. The following table outlines the suggested number of cores based on the number of named users.

Table 3 - Number of CPU cores based on named users
ScenarioNumber of named usersNumber of CPU cores
Small< 1004
Medium100 - 10004 - 8
Large1000 - 50008 - 16
Extra Large5000 - 1000016 - 32

Scaling beyond 10,000 users, the general rule is to add a minimum of 1 core for each additional 1000 named users above the first 10,000 named users. For example, if there are 25,000 named users, this means 32 + 15 = 47 ~ 48 cores. In general, the more cores on your machine the better - cores will not be wasted. The cores which are not used to execute queries concurrently will be used by Dynamic Cubes to perform internal operations in parallel, which will improve individual query performance.

Figure 2 is a graphical representation of the data in Table 3.

Figure 2 - IBM Cognos 10.2 named user CPU core recommendations using the data from Table 3
Figure 2 - IBM Cognos 10.2 named user CPU core recommendations using the data from Table 3

One of the assumptions behind the recommendations above is that the volume of data processed by Dynamic Cubes itself remains relatively constant although the amount of data processed in the relational database may increase. If this assumption does not hold true, it may be prudent to add additional cores based on the size of the aggregate and data caches (see below), allotting an additional core for each 5 GB of memory of the combined size of both caches.

Memory Requirements

Dynamic Cubes uses memory for storing the members of all hierarchies, storing data in the data and aggregate caches, and providing temporary space for the execution of MDX queries within the DQM server. In order to compute the total amount of memory required for a cube, the following computation can be used:

<member cache size> + <data cache size> + <aggregate cache size>
  + <temporary query space> + <JVM adjustment>

Member Cache

Dynamic Cubes loads all of the members of a cube's hierarchies into memory when a cube is started. The intent of this is to provide fast access to member metadata, not just for populating the metadata browser in the various BI studios but also to provide the DQM server with fast access to members and member metadata during query planning and MDX query execution, and to the SQL generation component of Dynamic Cubes.

The estimated amount of memory required to store a member in a Dynamic Cube is 1440 bytes on a 64 bit Java Virtual Machine (JVM) and approximately 720 bytes on a 32 bit JVM (or a 64 bit JVM with compressed references). Compressed references in a 64 bit JVM are only applicable if the memory used does not exceed 32 GB. Since in most cases the largest two dimensions (hierarchies) dwarf the size of all other dimensions, the estimated size of the member cache is the sum of the number of members in all of the hierarchies of these two dimensions multiplied by 1440 bytes. This must be computed for each cube on a server whether it is a base or virtual cube. All hierarchies in the two largest dimensions which are of the same order of magnitude in size must be included in this computation as each hierarchy's members are stored separately from one another.

The 1440 bytes per member does not take into account the presence of additional member properties. Member properties are defined on a level-basis and consume approximately 2 bytes per character of each member property's value for each member. This should be taken into account when the largest dimensions contain additional properties. When calculating the amount of memory for the data cache, the member size of 1440 bytes should be used, regardless of the presence of member properties.

So, the following equation can be used to compute the amount of memory required for the member cache:

<# of members in the largest hierarchies> * 1440 bytes

Overview of the Data Cache

When estimating the amount of memory required for the data cache, what is most important is the number of possible data points in a dynamic cubes’ dimensional space warehouse and the manner in which those data points are accessed in reports and analyses.

Of all of the possible data points in a cube, it is expected that users as a whole will access a small common subset, and individuals will use smaller portions of the cube that are pertinent to their individual data exploration. This is important because a Dynamic Cube’s data cache only loads data on demand, and all detail facts, unless explicitly required by a query, remain in the relational database.

For example, if a fact table contained 10 billion rows of fact data for 10 years of data, and a query requests the annual sales totals for each year for 25 sales districts, a Dynamic Cube only stores 250 data points in its data cache, even though the database was required to read all data in the fact table.

It is typically only when queries filter large sets of members by their associated measure values that a large number of values are stored in the data cache.

From a reporting/analysis point of view, limiting the filtering of entire swathes of a leaf level of a large dimension can help in reducing the required size of the data cache – restricting such filter expressions by a member at a higher level can make a significant impact (e.g., filtering customers within a region as opposed to all customers of an entire organization).

It is also important to keep in mind that the data cache has a fixed size and will evict earlier query results if it approaches its maximum size. Dynamic Cubes use a heuristic to determine which query results it should evict from its cache. If data that was once retrieved is evicted, a subsequent request will cause the Dynamic Cube to re-retrieve the data from the underlying database.

Note that over the course of a business day, users may explore data, most likely making use of data previously retrieved by other users running the same or other related reports. During that time, some data will be viewed a number of times and then not examined again – it becomes stale and is an ideal candidate for eviction. The removal of these query results from the data cache do not impact performance, but do allow the data cache to provide space for subsequent queries retrieving previously unviewed data.

Computing Memory Requirements for the Data Cache

The size of the data cache depends on the number of cells required by reports, analyses, and dashboards, as well as the presence of an aggregate cache. Since it is difficult to forecast report/analysis behavior, the data cache size is estimated based on the size of the dimensional space, which is an estimate based on the size of the member cache. As the size of the dimensional space grows, so will the size of the data cache.

Note that each cell stored in a dynamic cube will consume 80 bytes in a 64 bit JVM. As a result, approximately each gigabyte of the Dynamic Cube's data cache will store 11 million to 12 million values.

In the calculation below, note that the presence of an aggregate cache reduces the amount of space required for the data cache (it needs to contain fewer values). The aggregate cache will be discussed in the next section. As well, each user is likely to retrieve some data which is specific to the report they are executing, even if they are possibly running the same report e.g. prompt values are different.

user factor = (# of named users * 200K)

In addition, every 5,000 named users require an additional 1 GB of memory.

If an aggregate cache is present then the minimum data cache size is:

The greater of (10% of the member cache size + user factor) or
((20% of the member cache size + user factor) - (size of the aggregate cache))

If an aggregate cache isn't present then the minimum data cache size is:

(20% of the member cache size) + user factor

Note that this is a minimum requirement - allotting more memory for the data cache will improve performance as the system gets used. If you want to widen the area of analysis by loading more leaf level data, you should consider increasing the data cache and scheduling cache-priming jobs on cube start.

Computing the Size of the Aggregate Cache

The aggregate cache contains a collection of measure values at the intersection of different combinations of levels from various hierarchies in a cube. The aggregate cache can satisfy many data requests without the need to retrieve values from the database which reduces the amount of data that needs to be stored in the data cache.

This is especially true since the aggregate cache's contents are based on user workload - the contents are tuned to maximize the use of the data in the aggregate cache. As a result, when an aggregate cache is present, it can make sense to reduce the size of the data cache.

The contents of the aggregate cache are intended to provide quick access to values computed at higher levels of aggregation in a cube. Though the number of values can grow as a cube's overall size increases, the amount of data required in the aggregate cache is not necessarily linearly correlated to the increase in cube's size. As a result, a sliding scale is used to compute the minimum size of the aggregate cache relative to the size of the member cache.

The aggregate cache is a case where "more is not necessarily better". The more aggregates that are defined mean more objects in the system, in addition to the additional work to load the aggregates. Given the amount of memory needed to hold the aggregates necessary for your workload, be wary of going beyond that number even if you have the memory space defined by the limits above.

It is recommended that the aggregate cache of a cube not exceed 30 GB. Table 4 provides estimates for the size of the aggregate cache based on a percentage of the member cache size.

Table 4 - Aggregate cache estimates based on member cache
ConfigurationAggregate cache size as % of member cache
Small60%
Medium50%
Large25%
Extra Large25%

Temporary Query Space

Each concurrent query executed against a Dynamic Cube requires space for the construction of intermediate result sets for use within the DQM MDX engine. In general, there is some overhead per query that needs to be accounted for, as well as space allotted for one or more queries which are atypical and may retrieve large volumes of data that require large amounts of memory to transfer the values within the engine.

On a server which hosts multiple cubes, the assumption is that a single user with access to multiple cubes executes a query on only one of those cubes at a time. Consequently, the amount of memory required for temporary query space is computed for each logical group of concurrent users and the cubes to which they have access. These individual values are added together to obtain the amount of memory required on the server to support concurrent queries across all the groups of users and the cubes to which they have access.

The calculation to compute the memory required for query processing for a group of users and the cubes to which they have access is as follows:

Concurrent query usage size = [size of max query memory usage]
  + [average memory usage per query] * (# of concurrent users - 1)

Note that the average memory usage per query value does not include the size of the maximum query.

The average memory usage per query is approximately 5% of the size of the largest member cache amongst the cubes to which the group of users have access. The size of max query memory usage is approximately 450% of the size of this member cache.

JVM Garbage Collection Adjustment

With Java 6, the garbage collection (GC) cycle can spend considerable time analyzing the heap to identify the garbage which can be discarded by the JVM. In order to reduce the occurrence of the GC cycle, as well as the time spent performing the GC cycle, it is recommended that at least an additional 10% of the calculated memory requirements for a cube be available in memory and also assigned to the DQM JVM. This additional memory should not be required if you are making use of the 'balanced' GC mode available with the Java Runtime Environment (JRE) which ships with WebSphere 8.

Miscellaneous Adjustments to the Memory Requirements

If the member cache will be refreshed while a cube is available, then it is necessary to double the amount of memory required for the member cache because Dynamic Cubes will build the new member cache in the background while the current member cache is being used; for a short time both member caches are present in memory.

If a cube exists solely for use within a virtual cube and it is not accessible from any reporting packages, it may not be necessary to assign a data cache to a cube since data will be cached in the virtual cube which is being accessed directly by users. It is also not necessary to account for the user query space in intermediate cubes since they do not involve the MDX engine.

If a virtual cube is built to combine historic and recent data into a single cube, it may make sense to assign a data cache to the base cube as this will ensure fast query performance when the recent data cube is updated.


Disk Space Recommendations For Dynamic Cubes

The result set cache retains a copy of the result of MDX queries executed against a Dynamic Cube on disk in a binary format. Entries in the cache are identified by the MDX query and the combination of security views of the user who executed the query. This information is stored in a small, in-memory index that is used to quickly search for cache entries. The result set of a query is only added to the cache if it exceeds a predefined, minimum query execution time, to ensure the cache is not populated with the result of queries which execute quickly without an additional caching required. This value is configurable in the IBM Cognos Administration console on a per-cube basis.

The benefits of the result set cache are most obvious when a group of users are executing a common set of managed reports that have a limited number of prompts. The more opportunities there are for differences in a report's specification, the less often the result set cache will be used. Ad hoc analysis is unlikely to make much use of the result set cache, except in cases where a user drills up, in which case a previous analysis is re-executed and can make use of the result set cache. Consequently, the retention of an MDX result set in the cache does not necessarily benefit subsequent users, simply because there are limited chances that the exact same query will be executed in the future.

The size of the result set cache is tied to the number of active users - the more users, the larger the cache needs to be to retain MDX query result sets. Result sets are typically very small, approximately 10KB to 20KB in size. The average size is conservatively estimated at 50KB per result set, and each user is allotted 200 entries in the cache, which equates to 10 MB of disk space per active user (or 100K per named user). Table 5 provides a quick reference for disk space estimates.

Table 5 - Disk space estimates
ConfigurationEstimated disk space requirement
Small1 GB
Medium10 GB
Large50 GB
Extra Large100 GB

A result set cache is only required for cubes which are directly accessible by users, i.e., there is at least one published package which includes the cube as a data source.


Recommended Hardware Sizing For Various Scenarios

Table 6 provides the details behind the initial, high level sizing table earlier in this document. The calculations used to compute the values in the table are those described above. This is strictly meant as an example of how these values can be computed.

The Final Base JVM Heap Size column represents the amount of memory required for a Dynamic Cube, including the JVM adjustment discussed above.

As discussed earlier, the JVM Heap (member cache refresh support) column represents the amount of memory required for a Dynamic Cube for which member cache refresh will be performed. This value is computed as:

Final Base JVM Heap Size + Member Cache Size

Also, the JVM Heap (intermediate cube, no direct access, no member cache refresh) column denotes the amount of memory required for a cube that has no direct access and is used as the basis of a virtual cube. The value is computed as:

Final Base JVM Heap Size - Data Cache Size (base) - Data Cache Size (user adjustment)

The table below contains the values discussed earlier in this document as a single source reference.

Table 6 - All Sizing Estimates from Document
CategorySmallMediumLargeX Large
Number of Named Users1001000500010000
Number of Members in 2 Largest Dimensions600,0003,000,00015,000,00030,000,000
Member Cache Size840 MB4.22 GB21.1 GB42.2 GB
Data Cache Size (base)84 MB420 MB2.11 GB4.22 GB
Data Cache Size (user adjustment)20 MB200 MB1 GB2 GB
Aggregate Cache Size500 MB2.10 GB5.3 GB10.6 GB
Average Query Memory per User42 MB210 MB1 GB2.1 GB
Peak Query Memory420 MB2.1 GB11 GB21.1 GB
Concurrent User Query Memory462 MB4.2 GB61 GB231 GB
Base JVM Heap1.91 GB11.1 GB90.5 GB290 GB
JVM Heap Adjustment (GC avoidance)191 MB1.1 GB9.1 GB29 GB
Final Base JVM Heap Size2.1 GB12.2 GB99.6 GB319 GB
JVM Heap (member cache refresh support)3.02 GB16.9 GB123 GB365 GB
JVM Heap (intermediate cube, no direct access, no member refresh)1.52 GB9.22 GB84 GB289 GB
Hard Disk Space1 GB10 GB50 GB100 GB

Appendix A

For more information on Dynamic Cubes see the IBM Cognos Dynamic Cubes Redbook at http://www.redbooks.ibm.com/abstracts/sg248064.html.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=845599
ArticleTitle=IBM Business Analytics Proven Practices: Dynamic Cubes Hardware Sizing Recommendations
publish-date=11162012