Dear Friends of the Debugging-Fun,
A few days ago, I was presented an interesting issue. One of my customers experienced an unacceptable wait problem while starting their BPM AppCluster.
As always, the first step is to look at the SystemOut.log to see, if there is anything interesting in there.
In this case, there were a vast numer of DB queries, similar to this:
select data,version_id from lsw_bpd where version_id in (select distinct ver.po_version_id from lsw_po_versions ver inner join lsw_snapshot snap on (ver.branch_id=snap.branch_id and ver.start_seq_num<=snap.seq_num and ver.end_seq_num > snap.seq_num) where ver.po_type=25 and snap.name is not null)
Yep, yet again another Performance debugging.
As my customer mentioned the delay starts right then, when cache settings were loaded, I checked the SystemOut.log for anything Cache related during the start-up.
I found the following line:
000000ae wle I CWLLG2155I: Cache settings read have been from file file:
As you can see, the threadID for that activity is 000000ae. So I grepped for 000000ae to see the whole thread.
And I was quite surprised to see a !9 MINUTE! delay in the start-up. You know, most of the performance cases we see are on a very high level, meaning we are talking of improvements of a few seconds. But this one, I have to tell you, had a good point. Now 9 minutes is an awefully long time to wait. 9 minutes - that is about 2 cups of coffee. Too long, as you will most likely agree.
From the information in the logs it seemed that the delay was related to the Cache, as it occured between the cache settings being loaded and the cache being initialized.
So, where could this be coming from? Maybe some slow DB causing this delay? Well, we do not yet have enough information to tell, hence we need some traces.
The following trace string is the one to start with for such issues:
The trace.log gives us more info.
We can see the known code from the screenshot above "CWLLG2155I" and some cache activities for e.g. LRU Cache, BDAC(BusinessDataAliasCache), etc.
[3/24/15 9:10:09:109 EET] 000000ae wle I CWLLG2155I: Cache settings read have been from file file:/opt/IBM/BPM/v85/BPM/Lombardi/process-server/lib/procsrv_resources.jar!/LombardiTeamWorksCache.xml
[3/24/15 9:10:09:749 EET] 000000ae wle_versionin 3 com.lombardisoftware.utility.collections.LRUCache createCache maxSize :64
[3/24/15 9:10:09:749 EET] 000000ae wle_versionin 3 com.lombardisoftware.utility.collections.LRUCache <init> New LRUCache created :853f63c0
Here we can see the Cache size, which reminds me of a useful article on how to configure them for tuning purposes. Find the link here.
[3/24/15 9:10:17:496 EET] 000000ae wle 1 com.lombardisoftware.core.cache.LocalCache initCache BusinessDataAliasesCache initC()
[3/24/15 9:10:17:496 EET] 000000ae wle 1 BDAC - initialized
[3/24/15 9:10:17:496 EET] 000000ae wle 1 BDAC - loadCacheData(name=BusinessDataAliasesCache, key=*
[3/24/15 9:10:17:500 EET] 000000ae wle > com.lombardisoftware.client.delegate.common.ServiceLocator lookup ENTRY jdbc/TeamWorksDB
I took a deeper look at the cache activities and stumbled across the following:
The BDAC cache load started at: 09:10:17
And it finished at: 09:16:26
So, what does this mean? These are 6 minutes only for the BDAC cache (which is 2/3 of 9 minutes load time). Aha!
So, now we need some further info on the BDAC. Upon further research, I found this:
* Searchable business data aliases are *
* the columns that you can use in saved *
* searches, the TWSearch API, and the *
* REST Search API. If you set the *
* <use-business-aliases-for-process- *
* instances> property value to "false", *
* the list of searchable business data *
* aliases are returned based on *
* snapshot definitions. *
This APAR was introduced for BPM 188.8.131.52 with a new property named <use-business-aliases-for-process-instances>.
The default value for that property is " T R U E ".
TRUE means that the list of searchable Business data aliases are returned only when they are used in actual process instances.
If FALSE is selected, the Business data aliases will be created at the server startup already BEFORE the first access by process instances. If there is a large number of BPDs, searching all the BPDs can take a very long time.
The more process applications are deployed on server the more aliases will be created, the longer the cache load time takes.
So, if this was introduced in 184.108.40.206 why did I still see this in 8.5.5? APAR JR47818 is part of BPM v8.5.5.
Hm, what could have happened here?
As my grandmother would have said: Never trust a running config (if she had known what a running config was). Hence lets have a look at the 99local.xml file on my BPM 8.5.5 setup to verify the default configuration for that attribute.
Indeed, the value is set to " F A L S E " !
Checking the customer's 99local.xml and 100Custom.xml showed the same.
Let's wrap up our findings:
1) APAR JR47818 (with BPM 220.127.116.11 ) introduced a new property <use-business-aliases-for-process-instances>. The default value for that property is " T R U E ".
2) BPM 18.104.22.168 includes APAR JR47818 but in that version the default value for the property <use-business-aliases-for-process-instances> is " F A L S E ".
That is quite a surprise! But at least, we know what to do next.
Exactly! We need to change that property to true.
So I advised my customer to change the setting for that property to " T R U E " and the problem was solved.
As you can see, sometimes all it takes is a bit of time and research to tackle these kind of issues.
And if this article does not help, take two of these and call me in the morning.