Accelerating SAP CO-PA with DB2 for Linux, UNIX, and Windows
Use DB2 intrapartition parallelism for reporting queries
Traditional online transaction processing (OLTP) workloads are characterized by large numbers of parallel queries. Queries that are executed independently from one other can be evaluated in parallel by the database server. As a result, large, multicore systems are used to their capacity by processing each query by a single core at a time.
Online analytical processing (OLAP) workloads, on the other hand, are typically characterized by small numbers of parallel-running queries. They read and aggregate a large amount of data and model parts of the application logic. As a result, OLAP queries are more complex and take a long time to run. The processing capabilities of a single CPU core are limited. So, to speed up OLAP queries, you need to process them in parallel to use all the cores of a large, multicore system. This parallel processing is either data-oriented, by using the database partitioning feature (DPF), or query-oriented, by using intrapartition parallelism. Because it's straightforward and easy to maintain, intrapartition parallelism is the best choice for enabling parallel query processing with IBM® DB2®.
SAP® Profitability Analysis (CO-PA) reports, which are primarily used for drill-down analysis, fetch a significant number of records and aggregate them to a small result set. Traditionally, those results are calculated in advance and stored on the SAP application server to give users acceptable response times. Using intrapartition parallelism can possibly speed up the query processing significantly. This speed enables the application to skip the pre-calculation step and to read data for each navigation step directly from the database. This eliminates the functional restrictions that are typically in place because of resource limitations on the application server and reduces the number of required summarization levels. Based on an example report, an increased speed of eight times or more could be reached.
Data access methods in SAP CO-PA
In SAP CO-PA reports, all data is initially presented at a coarse level of detail. You can refine your query to get a more detailed view on a certain aspect. SAP CO-PA can be configured to load the data all at once during the program start or at each step as you walk through the drill-down steps.
If you request all the data to load during the program start, the data that is read from the database has a fine granularity. If read on demand, the data has the granularity that is needed for the corresponding step. Both methods have advantages and both can profit from intrapartition parallelism.
Requesting data only one time, when the report is loaded, has several drawbacks:
- The initial report might take a long time to load.
- The memory consumption on the application server is high during the entire runtime of the report.
- The amount of data that can be evaluated by a report is limited by the memory available to a single process on the application server.
Loading the data on demand does not require the data to be kept on the application server, resulting in lower memory consumption. Additionally, the report starts faster since fewer rows have to be transferred from the database to the application server. Data loaded this way results in higher performance-related requirements. When dealing with huge amounts of data, an acceptable response time can only be achieved with intrapartition parallelization.
To reduce the memory consumption and speed the query processing, SAP CO-PA uses pre-calculated summarization levels. Those summarization levels are physically persistent aggregates at different levels of detail. To fetch as little data as possible, SAP CO-PA uses the most coarse summarization level that matches the requirements of a report. Those summarization levels can be defined and used independently from the SAP CO-PA reports to access all SAP CO-PA data. Also, you can use report-specific summarization data, which contains only the data for a certain SAP CO-PA report.
Reducing the number of summarization levels eliminates redundant information, saves space on the application server, and reduces the time and resource consumption that is required to build the summarization levels.
Reading data on demand eliminates memory limitations on the application server. With faster query processing, fewer summarization levels are required to meet the runtime requirements and creating them is faster as well.
The following example demonstrates how to use intrapartition parallelism to improve the performance of SAP CO-PA queries. To show the workload characteristics of SAP CO-PA, I describe the report configuration and explain the data flow and structure of SAP CO-PA. I also discuss the properties of the data used in this scenario. In addition, I review how to determine the best summarization levels and how to enable intrapartition parallelism.
Reports are configured by using the
KE30 transaction. To see
the performance settings, select Edit Report > Options >
Performance > Presummarized data. From there, you need to choose one of the three data access methods:
- Use summarization data: Use report-specific, dynamically created summarization data (frozen report data).
- Use a summarization level: Read system-wide, defined summarization levels.
- Read upon each navigation step: Read data as required. This option uses summarization levels as well.
The preferred option is "Read upon each navigation step". This data-access method enables on-demand reporting and eliminates the need for long-running calculations to prepare aggregates and the restrictions regarding the memory consumption on the application server.
All data that is collected by SAP CO-PA is stored separated into operating
concerns which correspond to the organizational units of a company. Each
operating concern holds its data in four tables. They are named according
to the schema
n is a number from
one to four and
xxxx is the name of the particular operating
concern. For the sake of clarity, they are referenced as
in the following example.
Line items are written to the tables
divided by the plan and actual data. This data is merged and
pre-aggregated, and the result is
split and written into the segment level
CE3 and segment
CE4. Key figures and the segment-specific
characteristics record type, planned/actual indicator, version, and period
are written to
CE3. The remaining characteristics are written
CE4. Those two tables are the main data source for reports
and the creation of summarization levels. Usually, several line items share
the same combination of characteristics. Therefore, table
can be condensed again. In the example scenario, this results in a
cardinality of 62,509,896 rows for
CE3 and 24,603,426 rows
A report provides a view of the data that is created by defining certain
predicates and drill-down possibilities. Predicates reduce the scope of
the report and can be defined by any characteristic. A drill-down is
defined by a sequence of characteristics. Initially, data is aggregated by
the first drill-down characteristic c0. This coarse
overview is presented to the user. When an entry is chosen, its value
v0 is added to the list of predicates and the next
characteristic c1 is added to the
GROUP BY clause of the corresponding SQL statement.
As all drill-down characteristics are stored in
number of rows fetched from
CE4 is reduced with each step,
while the number of rows fetched from
CE3 remains constant.
CE4 eliminates outdated combinations of
characteristics or items that are filtered by restrictions on
CE3. Due to the n:1 mapping between
CE4, the size of the join result corresponds to the
number of qualifying rows of
You should be aware of four relevant metrics related to the number of processed rows:
- The number of rows read from
- The number of rows that are meeting the join predicates
- The cardinality of the join result
- Size of the grouped result set
Those cardinalities are shown in Figure 1.
Figure 1. This simplified access plan shows the important cardinalities of each processing step.
example report has a predicate on table
CE3. On table
CE4, only drill-down predicates are applied. Table 1 shows the four rows for
each navigation step.
Table 1. Rows that are read and returned by each step and operator
|step||join result||aggr. result|
The report-specific predicates reduce the number of rows fetched from
CE3 from 62 million rows to 47 million rows. So, the report
analyzes 75 percent of the available data. The number of rows fetched from
CE4 in step 0 confirms that, initially, there are no
restrictive predicates on this table. Only 15 million rows of 24 million
rows fetched from
CE4 meet the join criteria, which indicates
that 37 percent of the data is either outdated or filtered indirectly by
Summarization levels provide data from
CE4 as persisted aggregates. Using the
transaction defines the level of detail
by the chosen characteristics.
The aggregate is materialized in two tables:
n+1 are four-digit integers padded with zeros.
K81(n+1) is similar to
CE3 and for the sake of
clarity we named it
CE3S. It contains all key figures but no
characteristics besides period and year. All characteristics, including
the remaining four that were originally stored in
K81n which is named
CE4S for clarity
in this paper. Reports read data from the aggregation level with the
lowest level of detail to reduce the amount of data that must be
Because it takes so long to
create them, the number of summarization levels is supposed to be as low
as possible. However, it's
important to have a high level of detail in the summarization levels so that the data can be shared
and reused by other reports. A summarization level is supposed to reduce the
number of rows. Moving the segment-specific characteristics from
CE4S increases the number of
characteristic combinations, while keeping the cardinality of
CE3S constant. Thus, choosing a level of detail that is too
high can increase the amount of data instead of reducing it.
Table 2 shows the cardinalities of the summarization level tables for each drill-down level of the example scenario. In addition to the corresponding drill-down characteristics, an additional characteristic was added to avoid too small tables.
Table 2. Comparison of original table cardinalities and the ones of summarization level tables for each drill-down step
The summarization levels for steps 3 and 4 are too detailed. Step 2
qualifies for a generic summarization level, as it is the most detailed
one were the cardinality of
CE4S is significantly lower than
the one of
CE4. Depending on performance requirements, you
might take step 1 into consideration as well.
Enable intrapartition parallelism
By default, SAP environments disable intrapartition parallelism. SAP Note 2047006 explains how to enable it system-wide or for certain statements or applications. SAP Note 2052896 describes SAP CO-PA-specific aspects.
A system-wide enablement sets the default degree of intrapartition
ANY. This means the degree is set to a value
which is computed automatically based on the number of available CPU
cores. If possible, setting the degree to
ANY is the best
choice as it requires the least amount of effort for setup and
If enabled for certain applications, intrapartition parallelism is enabled on the database side, but the default degree is set to 1. This minimizes the impact on other applications or workloads. The default degree can be overridden by an appropriate guideline added to a query by the database shared library (DBSL). The SAP Optimizer profiles give you a way to add optimization guidelines to queries based on the corresponding database object and the query structure. Their general usage is described in SAP Note 1818503.
Intrapartition parallelism for SAP CO-PA queries is enabled by adding an
optimizer guideline that sets a parallel degree for queries that read from
CE4. For this purpose, an optimizer
guideline is created by adding an entry to table
DB6_OPTPROFILE by using transaction
The column tab name corresponds to the object name displayed in
performance traces. In this case, it is the name of the
table that belongs to the operating concern. The pattern must
match all reading queries against the
CE4 and the
CE3 table. The actual guideline is
written to the column with the same name. Setting the degree value to
ANY allows DB2 to set the value according to the available
number of CPU cores. This results in a
as shown in Table 3.
SAP CO-PA uses separate tables for each operating concern. You must use a separate profile entry for each operating concern and summarization level.
Because predicates of SAP CO-PA queries are defined on fields of
CE4, it's smart to evaluate them first and then join
CE3 by using the index. Nevertheless, the optimizer
might choose other access plans, depending on the kind of predicates and
the distribution of data. This behavior can negatively affect the
performance and can cause varying response times. It might be necessary to
extend the guideline to enforce a fixed access method of
CE3 as shown in the following example:
<DEGREE VALUE="ANY" /><NLJOIN><ACCESS
TABLE='CE4xxxx' /><IXSCAN TABLE='CE3xxxx' /></NLJOIN>
The results describe the database runtime of a query that fetches most
of the data from
CE4, or the
corresponding summarization level tables. All aggregates are updated, and
no additional data is loaded from
A measurement with intrapartition parallelism enabled and the default degree set to "1" is used as reference. Based on the available hardware resources, a query degree of "16" is used for the measurements that use intrapartition parallelism.
Without summarization levels
The runtimes in Table 4 are measured without any summarization level and indicate how reports will perform on a large amount of data. However, building summarization levels does not only reduce the amount of data that must be processed, it also changes the relation and the ratio of table cardinalities as shown in Table 2. The comparability of a measurement on a plain operating concern and a summarization level of a similar size is limited.
Table 4. Drill-down runtimes reading on demand without summarization levels
|step||degree 1||degree 16||speedup|
Reading all data at once takes 3,014.93 seconds. Using intrapartition parallelism at a degree of 16, reduces the query runtime to 1,117.82 seconds, which is 2.70 times faster.
With a single summarization level
The most appropriate level of detail for a summarization level depends on the scenario and requirements. The way to determine the most detailed but still reasonable level of detail is described in Summarization levels. To tune a single report, a lower level of detail might have to be chosen.
The most detailed summarization level for this example scenario is at step 2. Table 5 shows the runtimes and performance improvements that are gained when you use intrapartition parallelism with a summarization level at step 2. Using intrapartition parallelism improves the average performance by 8.05 times.
Table 5. Drill-down runtimes read on-demand with a summarization level at step 2
|step||degree 1||degree 16||speedup|
To get the best performance out of this report, the summarization level's depth of detail must be as low as possible. The runtimes of each navigation step without summarization levels in Table 4 need to be examined. If database runtimes below two seconds are required for this scenario, step 1 must be chosen for the summarization level because it's the first one that does not meet the requirements. This procedure might have to be repeated for larger scenarios to determine other summarization levels. The results of a drill down with a summarization level at step 1 are shown in Table 6.
Table 6. Drill-down runtimes read on demand with a summarization level at step 1
|step||degree 1||degree 16||speedup|
The measurements show that using intrapartition parallelism is 7.24 times faster than without intrapartition parallelism. Although it's a little less than with a more detailed summarization level, the total runtime of the drill-down process is 2.49 times faster, and each query runtime is below 2 seconds.
The increased speed that's created by intrapartition parallelism with a more detailed summarization level indicates that the speed increases as the data that is being processed by the database increases. Depending on the amount of data, you can achieve even better performance improvements when you use a higher degree of parallelization and more CPU cores.
The use of intrapartition parallelism increases queries' performance significantly. Using intrapartition parallelism requires fewer summarization levels to meet the response-time requirements. Because the data model is based on the operating concern and shared among all reports, the definition of some summarization levels might still be necessary in cases where the application server reads all the data at once or if a certain report usually reads data on a coarse level of detail.
In general, it makes sense to read data on demand to reduce the resource consumption on the application server. In this way, summarization levels can be used wherever they are required for performance reasons, but they are no longer needed to cope with functional restrictions caused by memory limitations on the application server. Using intrapartition parallelism reduces the query processing time and provides acceptable response times to the user.
The measurements confirm that the benefit of intrapartition parallelism increases with the amount of analyzed data. The example scenarios demonstrate an improvement up to a factor of eight. We tried to use a realistic test setup, but you may see different results in your production environments.
- Access SAP Notes for more
information about the SAP products. Notes related to this article are:
- SAP Note 1818503: DB6: SAP Optimizer Profiles
- SAP Note 2047006: DB6: Use of DB2 SMP Parallelism (INTRA_PARALLEL=YES)
- SAP Note 2052896: DB6: Faster CO-PA reporting on database "IBM DB2 for Linux, UNIX, and Windows"
- DB2 partitioning features on developerWorks
- IBM Knowledge Center: Enabling intrapartition parallelism for queries
- Performance in the Information System: Profitability Analysis (COPA) - SAP Library