IBM Support

Loader throwing OutOfMemory and or Loader slowness in master cache process due to low memory

Troubleshooting


Problem

Loader throwing OutOfMemory and or Loader slowness in master cache process due to low memory.

Symptom

Loader processing can be slow due to memory issues if:
  1. No. of IDs to be cached by instance is too high, which can happen if LIF distribution is not in order.
  2. Data to be cached by loader depends on properties of entity and nature of data, which is different for every environment. More the no. of properties and incorrect LIF distribution, more memory is required.
Following log snippet shows request-response between loader ad master cache:
  1. MasterCache Response OK:

Loader

20:59:32,461 INFO  [LocalCache] TaskID[454], Master cache instance is initialized for eri_eran_20q1_02 calling processRequestMaster

20:59:38,925 INFO  [MasterCacheResponse] TaskID[454], Master Response. Return Code [OK]

--

NC_CACHE

20:59:38,924 INFO  [NcMasterCache] Master Cache processing file [processRequest started  loader [eri_eran_20q1_02] file [/data/ldr/eri/eran/loader/input/INUSE.Sample_data1.lif]]

20:59:38,924 INFO  [NcMasterCache] Master Cache processing summary: file [/data/ldr/eri/eran/loader/input/INUSE.Sample_data1.lif] started [Wed Nov 04 20:59:38 CET 2020], finished [Wed Nov 04 20:59:38 CET 2020], duration [0], response [OK]

2. Master Cache Response [RELOAD]:

Loader

20:59:25,250 INFO  [LocalCache] TaskID[454], Master cache instance is initialized for eri_eran_20q1_02 calling processRequestMaster

20:59:32,460 INFO  [MasterCacheResponse] TaskID[454], Master Response. Return Code [RELOAD]

--

NC_CACHE

20:59:32,440 INFO  [NcMasterCache] Master Cache processing file [processRequest started  loader [eri_eran_20q1_02] file [/data/ldr/eri/eran/loader/input/INUSE. Sample_data2.lif]]

20:59:32,459 INFO  [NcMasterCache] Master Cache processing summary: file [/data/ldr/eri/eran/loader/input/INUSE. Sample_data2.lif] started [Wed Nov 04 20:59:32 CET 2020], finished [Wed Nov 04 20:59:32 CET 2020], duration [19], response [RELOAD]

 In both the cases, nc_cache is finished processing in few milliseconds but the request itself reached with delay at master cache [nc_cache].

Cause

In order to deal with the slowness issue, one of the aspects is to monitor the memory consumption of loader instances at run time.

To monitor the loader memory consumption at run time, if it is not directly throwing outOfMemory error in loader logs or not generating the Java heapdumps for memory utilization, GC logging can be enabled at the application level for loaders and analyze the same for monitoring the instance-wise loader memory utilization in order to work on addressing slowness in loaders.

Environment

Product: Tivoli Netcool Performance Manager for Wireless

Release: 1.3.2, 1.4.x

Platform: Solaris; Linux; AIX

Diagnosing The Problem

Please follow below steps to enable the GC logging.
  • Navigate to the location /appl/virtuo/bin and search for file run_njloader.
  • Take the backup of existing run_njloader file.
  • Open the run_njloader file and add following line in JVM Properties:
properties="-Xverbosegclog:/appl/GC/verbosegc.%Y%m%d.%H%M%S.%pid.txt $properties"
File contents after adding above line:
image-20220309140718-11
  • Save the file and exit.
  • Create the directory GC in appl directory as /appl/GC for GC log storage.
  • Stop the problematic loader instance having slowness in processing.
  • Start the loader instance and verify in the output for GC location.
image-20220309141115-12
  • Restart the sapmgr service in order to take effect the changes.
image-20220309141225-13
  • Restart the loader instance again and resume data processing from the loader instance.
  • Post enabling the GC logging, it captures the logs as follows:
image-20220309141341-14
  • Put this file in any of the online portals such as GCEasy that plots the graph for analyzing the same.
image-20220309141424-15
  • The generated graph is like:
        A.  GC graph for one day data loading:
image-20220309141550-16
         B. GC graph for one week’s data loading:
image-20220309141650-17
There are multiple applications that are used to plot the required graphs for memory analysis. GCeasy is one of the portals which we have used to generate the GC graphs.
Sample GC logs analysis and results of local testing:
Simulation is for 3 Loader instances as mentioned below and loaded 1670 LIF files each with different dates.
Following table shows that the loading is fastest on the loader (huaweilteeutran_v100r016c10spc125) with default configuration (without Weak-HashMap and Refresh Threshold), but it is the first one to throw OOM error.

For 2nd loader instance (huaweilteeutran_v100r016c10spc125_02), we also got OOM error much later but the Loader recovered from that error due to Weak-HashMap and continued loading.
But it took longest time to load the LIF files more than 1 hour compared to other two loaders. This was due to empty cache that resulted in high NC-time.
For 3rd Loader instance (huaweilteeutran_v100r016c10spc125_03), did not throw OOM error and it also took 8 minutes more than the fastest loader (1st instance) to load the data.
From the test, it seems like best combination to avoid OOM and loader performance. This instance was faster than 2nd instance because this cleared cache only after it reached threshold.

Loader Instance

Memory

Threading

weak-HashMap

Refresh Threshold

No. of LIFs Loaded

Encountered OOM

Response Code

Time taken to Load

OK

RELOAD

huaweilteeutran_v100r016c10spc125

1.5 GB

0,0

FALSE

Not set

1670

Yes

1671

446

4 Hours 10 minutes

huaweilteeutran_v100r016c10spc125_02

2 GB

0,0

TRUE

Not set

1670

Yes

1674

2143

5 Hours 23 Minutes

huaweilteeutran_v100r016c10spc125_03

1.5 GB

0,0

FALSE

100

1670

No

1677

531

4 Hours 18 Minutes

In above table, you can see there are lot of RELOAD in 2nd instance because of emptying the cache. Also, 3rd instance has slightly more RELAODS. Although the 1st instance has thrown OOM but it processed the LIFs fastest.
These instances were fed with LIF files in the same time span, all starting at 1300hrs. Out of these three instances, Instance 1 and Instance 2 were out of memory at a certain point, but after sometime the application was revived. Following are GC stats obtained through the generated GC logs (all the red shapes in the graph relates to GC run)
GC is the process by which Java performs automatic memory management and will free-up the heap memory.

Loader Instance 1

 

image-20220309170923-1

The numbers denote the following:

  1. The LIF loading process started.
  2. GC in action, clearing heap memory to prevent out of memory(OOM) error.
  3. GC in action, clearing heap memory, as the previous peak almost reached the total heap memory.
  4. The heap memory is almost exhausted and GC tries to clear out memory to prevent OOM.
  5. Finally, the system goes out of memory. The loader process is restarted which is initiated by sap-mon and the system is revived. We see a high release of memory, which is due to starting of a new loader process.
  6. The LIF loading process is completed.
                  

Loader Instance 2

 

image-20220309171142-2

The numbers denote the following:

  1. The LIF loading process started.
  2. GC in action, clearing heap memory to prevent OOM. This also involves weak working of HashMap implementation. (GC cycle removes the weak-HashMap references actively).
  3. The system finally goes out of memory. GC works on releasing the memory and system is revived. Thus we see a sharp release of memory. (Loader process is not restarted manually or by sapmon)
  4. The LIF loading process is completed.

Loader Instance 3:

 

image-20220309171746-3

           The numbers denote the following:
  1. The LIF loading process started.
  2. The heap memory is almost exhausted (free memory reaching 100mb or below). The threshold configuration comes into action, clearing cache, which is followed by GC that releases the memory. 
  3. GC in action, clearing heap memory, as the previous peak almost reached the heap memory.
  4. Again, the heap memory is almost exhausted (free memory reaching 100mb or below). The threshold configuration comes into action, clearing cache which is followed by GC which releases the memory.
  5. GC in action, clearing heap memory.
  6. LIF loading process is completed.
Following is the data, which was cached by 1st instance before throwing OOM error:
=================================
15:54:38,905 INFO  [LocalCache] TaskID[534], Allocated memory [1536], Current free/reserved JVM memory [15], Current memory consumed by loader process [1536], Nc cache Memory threshold [0]
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_NETWORK has 1 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_EUTRANCELL has 121873 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_IPSECURITY has 354 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_GTP_U has 383 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_IPV6 has 4 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_ENODEB_SIG has 315429 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_REGION has 1 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_FIREWALL has 295 no. of rows
15:54:38,905 INFO  [LocalCache] TaskID[534], The table NC_SCTP_H has 1227 no. of rows
15:54:38,906 INFO  [LocalCache] TaskID[534], The table NC_EUTRAN_TRAFFICBOARD has 161 no. of rows
15:54:38,913 INFO  [LocalCache] TaskID[534], The table NC_IKE has 73 no. of rows
15:54:38,913 INFO  [LocalCache] TaskID[534], The table NC_IP_PATH has 264 no. of rows
15:54:38,913 INFO  [LocalCache] TaskID[534], The table NC_SECTOR_EQUIPMENT has 1935 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_OPERATOR_CELL has 35 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_ETHERNETPORT has 453 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_ENODEBFUNCTION has 8891 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_BBUBOARD has 303 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_IP_LAYER has 227 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_MARKET has 1 no. of rows
15:54:38,914 INFO  [LocalCache] TaskID[534], The table NC_NODE_H has 1 no. of rows
=================================
We can see in above table that instance has to cache so many tables, rows that depend on no. of types of blocks loaded, cells being loaded.
So, we saw that the application went off-memory at certain point of time but eventually was revived by sap-mon re-starting the loader instance (Instance 1) or by running of GC (nc_cache aggressive memory clean-up, Instance 2). Also, for the third instance the threshold configuration and GC worked together efficiently to prevent any OOM.

Removing anything from cache does not release heap memory immediately. Heap memory will be released only after GC runs and removes weak references, Out of scope objects and so on.

Thus, optimal solution is to distribute LIFs according to region and block so that specific cells are catered by particular loader instances.
Note: The above example is captured in simulated environment and is meant only for understanding purpose. Memory requirement, time taken to load the LIFs (other parameters) might differ in each environment and depends on nature, type of data.

Resolving The Problem

Follow one or more steps for resolving the issue:

  1. Identify points on memory graph, check how frequently it reaches peak and increase memory assigned to loader.
  2. Implement or Modify LIF distribution so that it is based on Node or region, blocks across instances to minimize OOM and cache building. One Gateway instance can load raw files for specific region or regions that transfer LIFs to a set of loader instances. LIFs in these instances can be further distributed based on block.
  3. Modify LIF distribution to allow idle time of 2 - 5 mins before starting next batch of LIFs for every loader instance.
Note: Any changes should be considered post analyzing logs, stats, and implement after testing on pre-production environment consulting Administrators.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSKGHX","label":"Tivoli Netcool Performance Manager for Wireless"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Product Synonym

TNPMW;Tivoli Netcool Performance Manager Wireless

Document Information

Modified date:
13 March 2022

UID

ibm16562159