INTRODUCTION:
When experiencing OutOfMemory (OOM) scenarios related to Performance Data Warehouse (PDW) processing, by default in the Support cluster in a gold topology, the root cause will typically be one or a combination of the following reasons:
- Voluminous auto-tracking data
- Looping applications
- PDW related configuration
Addressing these root causes is described in other documentation, and is not discussed here. Instead, this document is specific to honing in on the PDW data that is driving these OOM scenarios, and deleting it from the PDW database. Of course, you will also need to address the root cause(s) mentioned above, so that these issues do not persist over time.
Increasing your maximum heap size might be a temporary solution for you to continue processing when you have a PDW related OOM scenario. However you most likely are still going to have to follow through with the steps below, and once the offending data is removed from the database, you should be able to reduce your maximum heap size back to it's original value.
An alternative to the database cleanup is to simply drop your PDW database and effectively start over, if that is something you are able to do, as documented here:
http://www-01.ibm.com/support/docview.wss?uid=swg21612755
Additionally, there is the "prune" command available in various forms that will allow you to delete all tracking data of a certain age. However, age is the only criteria available for pruning, and thus typically is not useful in an OOM scenario where the problem data is normally very current, unless you are willing to delete all your current data also by pruning.
Otherwise, there is no simplified means to do selective PDW tracking data deletion. Every scenario is unique and unpredictable in terms of the PDW database relationships that have to be followed and deleted in the correct sequence. This can only be done manually and very carefully, following the database constraints and using unique manually created SQL to do the deletions, and this is by product design.
OVERVIEW:
A high-level overview for pursuing the selective deletion of PDW data is as follows:
1. Find all tasks in the PDW database with large counts of LSW_TRACKING_POINT_VALUEs.
2. Find all tasks in the PDW database with large counts of LSW_TIMING_INTERVAL_VALUE.
3. Using these TASK_IDs (preferably one at a time), find the BPD instance IDs that generated this data, and use them to (later) determine the source of the high volume of data.
4. Delete the TG_PROCESSFLOWS data for the TASK_IDs found in steps 1 and 2.
5. Attempt to delete the LSW_TRACKING_POINT_VALUE data for the TASK_IDs in step 1. You very likely will get constraint violations, and therefore will need to follow the constraint violations and delete the data causing the violation.
Continue with step 5 until such time that you can delete all the LSW_TRACKING_POINT_VALUEs for the TASK_ID.
6. Attempt to delete the LSW_TIMING_INTERVAL_VALUE data for the TASK_IDs in step 2. You very likely will get constraint violations, and therefore will need to follow the constraint violations and delete the data causing the violation.
Continue with step 6 until such time that you can delete all the LSW_TIMING_INTERVAL_VALUEs for the TASK_ID.
As you work through these steps and make some partial cleanups for a TASK_ID, if enough time has passed, it is very possible for new data to come in, and once again create new relationships, etc. This is assuming you have not stopped the Support cluster or the PDW application (IBM_BPM_PerformanceDW) in the Support cluster during this manual deletion process. If so, a good practice when using these steps is, every time you come back to this process, start from the very beginning again, to ensure you are cleaning up any new data created since the last time you ran any SQL deletes.
DETAILS:
All SQL shown here is DB2 SQL; you will need to modify accordingly for MS-SQL or Oracle.
For most of the SQL below, it is possible to string together more then 1 TASK_ID at a time. Until you are comfortable with the process, you may want to just work through the process after step 2 with 1 TASK_ID at a time.
1. Find all tasks in the PDW database with large counts of LSW_TRACKING_POINT_VALUEs; start with counts greater then 10,000. Run the following SQL command on your PDW schema to find the TASK_IDs:
select * from ( select FUNCTIONAL_TASK_ID, count (*)
as total
from <pdwschema>.LSW_TRACKING_POINT_VALUE
group by FUNCTIONAL_TASK_ID
order by total desc )
FETCH FIRST 100 ROWS ONLY
Sample result:
FUNCTIONAL_TASK_ID TOTAL
491 24
217 20
206 14
As an example, I will use TASK_ID 491 for the LSW_TRACKING_POINT_VALUE data below.
2. Find all tasks in the PDW database with large counts of LSW_TIMING_INTERVAL_VALUE; start with counts greater then 10,000. Run the following SQL command on your PDW schema to find the TASK_IDs:
select * from ( select FUNCTIONAL_TASK_ID, count(*)
as total
from <pdwschema>.LSW_TIMING_INTERVAL_VALUE
group by FUNCTIONAL_TASK_ID
order by total desc )
FETCH FIRST 100 ROWS ONLY
Sample result:
FUNCTIONAL_TASK_ID TOTAL
491 8
217 7
206 5
As an example, I will use TASK_ID 491 for the LSW_TIMING_INTERVAL_VALUE data below.
3. This step gives you some options for finding the source of the high volumes of data for the TASK_IDs identified above.
If you are interested to see what project and what snapshot this functional id belongs to, you can run the following queries (or your own variations of similar queries) against the PDW database. This information is useful as it gives you the name of the sub-BPDs that might be involved, and can be handed over to your developers to pursue the data volume problems and prevent this problem in the future.
Using the above selected TASK_IDs, run the following SQL command on your PDW schema to give you the BPD instance ID information for the BPD instance involved with the high volumes of PDW data (you may want to also add the ACTIVITY_NAME to the query).
select distinct (SYSTEM_FUNCTIONAL_TASK_ID)
as bpd_instance_id, BPD_NAME, SNAPSHOT_ID
from <pdwschema>.LSW_TASK
where FUNCTIONAL_TASK_ID in (491)
Sample result:
BPD_INSTANCE_ID BPD_NAME SNAPSHOT_ID
206 Standard HR Open New Position 3f40e811-a09e-4fc5-8b5d-e4a098cd3e41
When you run this query, if you do not get back an integer for the BPD_INSTANCE_ID (for instance you get back something that starts with "x"), then the data problems are not from BPDs.
With this snapshot id, you can run another query against the BPM database to get further details:
select
p.name "Project", p.short_name "Short", s.name "Snapshot", p.is_toolkit "Is_toolkit"
from
<bpmschema>.lsw_snapshot s,
<bpmschema>.lsw_project p
where s.snapshot_id like '3f40e811-a09e-4fc5-8b5d-e4a098cd3e41%'
and s.project_id = p.project_id;
The result shows the project name, the project acronym, the snapshot name and if it is a toolkit or not.
Sample result:
Project Short Snapshot Is_toolkit
Hiring Sampe HSS v3.6 F
4. At this point you are now committed to deleting PDW data from the tasks with the large numbers of LSW_TRACKING_POINT_VALUEs and/or LSW_TIMING_INTERVAL_VALUEs. It is not necessary to stop the Support cluster or the PDW application on the Support cluster, but it is a good idea to do so in order to not have new data coming in while you are going through the rest of this process. Also, you may want to backup the PDW schema, in case you run into problems and find a need to backout.
Before any other deletions, first delete the rows in the TW_PROCESSFLOWS table for the TASK_IDs, a table that is used with autotracking. If the TASK_ID has no autotracking associated with it, no data will be deleted, so it is OK to see nothing being deleted.
Run the following SQL command on your PDW schema to delete data from TG_PROCESSFLOWS for the TASK_ID:
delete from <pdwschema>.TG_PROCESSFLOWS
where FUNCTIONAL_TASK_ID in (491)
Sample result:
DB20000I The SQL command completed successfully.
5. Run the following SQL command on your PDW schema to attempt to delete LSW_TRACKING_POINT_VALUE data from the TASK_ID:
delete from <pdwschema>.LSW_TRACKING_POINT_VALUE
where functional_task_id in (491)
Once this command is successful, then continue with step 6.
Sample result:
DB20000I The SQL command completed successfully.
The command will likely fail at least once, possibly even multiple times as you go through the following process. You should get a constraint violation because there are Target Group tables in the database with data referencing this data you are attempting to delete. You may see different details in constraint messages, but they all should give you a TG table causing the constraint. For example:
SQL0532N A parent row cannot be deleted because the relationship "DB2ADMIN.TG_AEMPREQUISITION121381434563.TG_AEMPREQUISI_FK0" restricts the deletion. SQLSTATE=23504
These constraints can be resolved by running the following delete command (Note: Your table names will be different):
delete from <pdwschema>.TG_AEMPREQUISITION121381434563
where FUNCTIONAL_TASK_ID in (491)
Sample result:
DB20000I The SQL command completed successfully.
If by chance you get a further constraint violation, you would continue doing this with the TG table identified in the constraint message. Then go back to the first delete in this step, and keep repeating this process until that first delete is successful.
6. Run the following SQL command on your PDW schema to attempt to delete LSW_TIMING_INTERVAL_VALUE data from the TASK_ID:
delete from <pdwschema>.LSW_TIMING_INTERVAL_VALUE
where functional_task_id in (491)
Then follow the same process as step 5, until this command completes successfully.
Again the command likely will fail, for example:
SQL0532N A parent row cannot be deleted because the relationship
"DB2ADMIN.TG_AEMPREQUISITION121381434563.TG_AEMPREQUISI_FK0" restricts the deletion. SQLSTATE=23504
These constraints can be resolved by running the following delete command (Note: Your table names will be different):
delete from <pdwschema>.TG_AEMPREQUISITION121381434563
where FUNCTIONAL_TASK_ID in (491)
Sample result:
DB20000I The SQL command completed successfully.
MISC. CONSIDERATIONS:
Executing the correct delete SQL for the constraints might get tricky, as there might be many TG tables that are affected. It is suggested that you involve your database administrator to follow the constraints and delete the data from the parent or child tables as appropriate.
After you complete this process for all the tasks returning over 10,000 results in steps 1 and 2 above, restart the support cluster, and it should continue without running out of heap memory.
If you find you continue to get OOMs, you will need to continue this entire process until such time you have found the source of all this data, and stopped it from being created in the first place.
Before restarting, you should check the row count of the BPM LSW_PERF_DATA_TRANSFER table, and if it is high, then check it again after the restart, after the PDW has a chance to process that data. Ensure the row count has gone down. And then, because you may have processed a lot of new data from that BPM Transfer table, you may have uploaded a lot of new PDW data, and therefore need to go through the whole process above again.
If you continue to have issues after pursuing this process, or if this process is not working for you, then please open a pmr to engage BPM L2 support to assist.
Thanks go to the following people who provided input to this article: Prasad Keskar, Klaus Ulrich and Thomas Nold.