IBM Support

Strange Crashes/Segmentation Faults in DB2

Technical Blog Post


Abstract

Strange Crashes/Segmentation Faults in DB2

Body

This is one of the latest interesting issues we have debugged in the lab where we performed analysis of core file, related db2 source code and diagnostic data. Sharing some key points of this strange issue. 
 

Initial Symptom was - whenever a connection is made to the database, it crashes the db2 instance. 
When trying to understand scope of this issue, it was noted that there were more issues on this machine (i.e scope was not just limited to one specific db2 instance/database): 
- Any db2 instance/database creation fails with SQL1224N   
- Any new install of DB2 fails with ‘segmentation fault’ 
- Any connect attempts to connect to existing database crashes, regardless of the instances in the box 
- Any restore database attempts in the box crash 
- Attempt to capture DB2 trace is incomplete and formatting trace dump crashes

 

db2diag.log shows 'Memory validation failure' error.

 

2016-12-22-09.28.59.307242-300 E556974486E1023       LEVEL: Critical
PID     : 3195                 TID : 47134241974592  PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : SAMPLE
APPHDL  : 0-9                  APPID: *LOCAL.DB2.161222142900
AUTHID  : DB2INST1             HOSTNAME: db2machine
EDUID   : 33                   EDUNAME: db2taskd (SAMPLE) 0
FUNCTION: DB2 UDB, SQO Memory Management, sqloDiagnoseFreeBlockFailure, probe:10
MESSAGE : ADM14001C  An unexpected and critical error has occurred: "Panic".
          The instance may have been shutdown as a result. "Automatic" FODC
          (First Occurrence Data Capture) has been invoked and diagnostic
          information has been recorded in directory
          "/SAMPLE/home/db2inst1/sqllib/db2dump/FODC_Panic_2016-12-22-09.28.5
          9.289067_0000/". Please look in this directory for detailed evidence
          about what happened and contact IBM support if necessary to diagnose the problem
.

2016-12-22-09.28.59.363664-300 E556975510E2580       LEVEL: Severe
PID     : 3195                 TID : 47134241974592  PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : SAMPLE
APPHDL  : 0-9                  APPID: *LOCAL.DB2.161222142900
AUTHID  : DB2INST1             HOSTNAME: db2machine
EDUID   : 33                   EDUNAME: db2taskd (SAMPLE) 0
FUNCTION: DB2 UDB, SQO Memory Management, sqloDiagnoseFreeBlockFailure, probe:999
MESSAGE : Memory validation failure, diagnostic file dumped.
DATA #1 : String, 28 bytes
Corrupt pool free tree node.
DATA #2 : File name, 39 bytes
3195.47134241974592.mem_diagnostics.txt
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
  [0] 0x00002ADE33A674C4 _ZN13SQLO_MEM_POOL32diagnoseMemoryCorruptionAndCrashEmPKcb + 0x284
  [1] 0x00002ADE34A29D9B _ZN13SQLO_MEM_POOL10MemTreeGetEmmPP17SqloChunkSubgroupPj + 0x46B
  [2] 0x00002ADE34A2A9D3 _ZN13SQLO_MEM_POOL19allocateMemoryBlockEmmjmPP17SqloChunkSubgroupPjP12SMemLogEvent + 0x53
  [3] 0x00002ADE34A280A1 sqlogmblkEx + 0xA21
  [4] 0x00002ADE31D8D8FF _Z13sqliLoadIDXCBP8sqeAgentPP9SQLD_IXCBP13SQLI_ROOTVCTRP8SQLD_TCBtjtP14SQLI_PAGE_DESC + 0x16F
  [5] 0x00002ADE31D8D5D8 _Z25sqliLoadIDXCBFromRootPageP8sqeAgentP16SQLB_OBJECT_DESCP8SQLD_TCBtPP9SQLD_IXCBP9SQLB_PAGEj + 0x188
  [6] 0x00002ADE31D8CE39 _Z8sqliindxP8sqeAgentP16SQLB_OBJECT_DESCP8SQLD_TCBjjPP9SQLD_IXCBt + 0x1B9
  [7] 0x00002ADE2DDF317D /SAMPLE/home/db2inst1/sqllib/lib64/libdb2e.so.1 + 0x135E17D
  [8] 0x00002ADE2DDEA7AB _Z11sqldLoadTCBP8sqeAgentP8SQLD_TCBi + 0xB8B
  [9] 0x00002ADE3499197C _Z10sqldFixTCBP8sqeAgentiiiiPP8SQLD_TCBjj + 0x52C
  [10] 0x00002ADE3496B70A _Z19sqldLockTableFixTCBP8sqeAgenttthmiiimmiPciS1_iP14SQLP_LOCK_INFOPP8SQLD_TCBjj + 0x16A
  [11] 0x00002ADE34986C92 _Z12sqldScanOpenP8sqeAgentP14SQLD_SCANINFO1P14SQLD_SCANINFO2PPv + 0x9D2
  [12] 0x00002ADE31B103E7 _ZN16sqlrlCatalogScan4openEv + 0x467
  [13] 0x00002ADE2DA394CE _ZN9ABPDaemon29distributeToSingleDBPartitionEsbRm + 0x3FE
  [14] 0x00002ADE2DA38AD4 _ZN9ABPDaemon27distributeToAllDBPartitionsEv + 0x424
  [15] 0x00002ADE2DA37F60 _ZN9ABPDaemon4mainEv + 0x270
  [16] 0x00002ADE2DA4153D _Z19abpDaemonEntryPointP8sqeAgent + 0x6D
  [17] 0x00002ADE3085D10C _Z26sqleIndCoordProcessRequestP8sqeAgent + 0x127C
  [18] 0x00002ADE3086B896 _ZN8sqeAgent6RunEDUEv + 0x2B6
  [19] 0x00002ADE31E0DCA4 _ZN9sqzEDUObj9EDUDriverEv + 0xF4
  [20] 0x00002ADE31661617 sqloEDUEntry + 0x2F7
  [21] 0x0000003DB3C0683D /lib64/libpthread.so.0 + 0x683D
  [22] 0x0000003DB30D4FCD clone + 0x6D


Two trap files in FODC_Panic_2016-12-22-09.28.59.289067_0000 showing following stacktrace.
 

<StackTrace>
-----FUNC-ADDR---- ------FUNCTION + OFFSET------
0x00002ADE393C6625 _Z25ossDumpStackTraceInternalmR11OSSTrapFileiP7siginfoPvmm + 0x0385
0x00002ADE393C622C ossDumpStackTraceV98 + 0x002c
0x00002ADE393C132D _ZN11OSSTrapFile6dumpExEmiP7siginfoPvm + 0x00fd
0x00002ADE33A195CF sqlo_trce + 0x03ef
0x00002ADE33A6EBFF sqloEDUCodeTrapHandler + 0x025f
0x0000003DB3C0ECA0 address: 0x0000003DB3C0ECA0 ; dladdress: 0x0000003DB3C00000 ; offset in lib: 0x000000000000ECA0 ;
0x00002ADE33A623D0 sqloCrashOnCriticalMemoryValidationFailure + 0x0020
0x00002ADE33A674CD _ZN13SQLO_MEM_POOL32diagnoseMemoryCorruptionAndCrashEmPKcb + 0x028d
0x00002ADE34A29D9B _ZN13SQLO_MEM_POOL10MemTreeGetEmmPP17SqloChunkSubgroupPj + 0x046b
0x00002ADE34A2A9D3 _ZN13SQLO_MEM_POOL19allocateMemoryBlockEmmjmPP17SqloChunkSubgroupPjP12SMemLogEvent + 0x0053
0x00002ADE34A280A1 sqlogmblkEx + 0x0a21
0x00002ADE31D8D8FF _Z13sqliLoadIDXCBP8sqeAgentPP9SQLD_IXCBP13SQLI_ROOTVCTRP8SQLD_TCBtjtP14SQLI_PAGE_DESC + 0x016f
0x00002ADE31D8D5D8 _Z25sqliLoadIDXCBFromRootPageP8sqeAgentP16SQLB_OBJECT_DESCP8SQLD_TCBtPP9SQLD_IXCBP9SQLB_PAGEj + 0x0188
0x00002ADE31D8CE39 _Z8sqliindxP8sqeAgentP16SQLB_OBJECT_DESCP8SQLD_TCBjjPP9SQLD_IXCBt + 0x01b9
0x00002ADE2DDF317D address: 0x00002ADE2DDF317D ; dladdress: 0x00002ADE2CA95000 ; offset in lib: 0x000000000135E17D ;
0x00002ADE2DDEA7AB _Z11sqldLoadTCBP8sqeAgentP8SQLD_TCBi + 0x0b8b
0x00002ADE3499197C _Z10sqldFixTCBP8sqeAgentiiiiPP8SQLD_TCBjj + 0x052c
0x00002ADE3496B70A _Z19sqldLockTableFixTCBP8sqeAgenttthmiiimmiPciS1_iP14SQLP_LOCK_INFOPP8SQLD_TCBjj + 0x016a
0x00002ADE34986C92 _Z12sqldScanOpenP8sqeAgentP14SQLD_SCANINFO1P14SQLD_SCANINFO2PPv + 0x09d2
0x00002ADE31B103E7 _ZN16sqlrlCatalogScan4openEv + 0x0467
0x00002ADE2DA394CE _ZN9ABPDaemon29distributeToSingleDBPartitionEsbRm + 0x03fe
0x00002ADE2DA38AD4 _ZN9ABPDaemon27distributeToAllDBPartitionsEv + 0x0424
0x00002ADE2DA37F60 _ZN9ABPDaemon4mainEv + 0x0270
0x00002ADE2DA4153D _Z19abpDaemonEntryPointP8sqeAgent + 0x006d
0x00002ADE3085D10C _Z26sqleIndCoordProcessRequestP8sqeAgent + 0x127c
0x00002ADE3086B896 _ZN8sqeAgent6RunEDUEv + 0x02b6
0x00002ADE31E0DCA4 _ZN9sqzEDUObj9EDUDriverEv + 0x00f4
0x00002ADE31661617 sqloEDUEntry + 0x02f7
0x0000003DB3C0683D address: 0x0000003DB3C0683D ; dladdress: 0x0000003DB3C00000 ; offset in lib: 0x000000000000683D ;
0x0000003DB30D4FCD clone + 0x006d
</StackTrace>
 

<StackTrace>
-----FUNC-ADDR---- ------FUNCTION + OFFSET------
0x00002ADE393C6625 _Z25ossDumpStackTraceInternalmR11OSSTrapFileiP7siginfoPvmm + 0x0385
0x00002ADE393C622C ossDumpStackTraceV98 + 0x002c
0x00002ADE393C132D _ZN11OSSTrapFile6dumpExEmiP7siginfoPvm + 0x00fd
0x00002ADE33A195CF sqlo_trce + 0x03ef
0x00002ADE33A6EBFF sqloEDUCodeTrapHandler + 0x025f
0x0000003DB3C0ECA0 address: 0x0000003DB3C0ECA0 ; dladdress: 0x0000003DB3C00000 ; offset in lib: 0x000000000000ECA0 ;
0x00000000004258FE __intel_ssse3_rep_memcpy + 0x19ee
0x00002ADE2DDF0E41 address: 0x00002ADE2DDF0E41 ; dladdress: 0x00002ADE2CA95000 ; offset in lib: 0x000000000135BE41 ;
0x00002ADE2DDEA339 _Z11sqldLoadTCBP8sqeAgentP8SQLD_TCBi + 0x0719
0x00002ADE3499197C _Z10sqldFixTCBP8sqeAgentiiiiPP8SQLD_TCBjj + 0x052c
0x00002ADE3496B70A _Z19sqldLockTableFixTCBP8sqeAgenttthmiiimmiPciS1_iP14SQLP_LOCK_INFOPP8SQLD_TCBjj + 0x016a
0x00002ADE34986C92 _Z12sqldScanOpenP8sqeAgentP14SQLD_SCANINFO1P14SQLD_SCANINFO2PPv + 0x09d2
0x00002ADE31B103E7 _ZN16sqlrlCatalogScan4openEv + 0x0467
0x00002ADE30C68A2C _ZN13sqm_evmon_mgr18getAutostartEvmonsEP14SQLP_LOCK_INFOPj + 0x022c
0x00002ADE30C67F5F _ZN13sqm_evmon_mgr15autostartEvmonsEv + 0x027f
0x00002ADE30C433B9 _Z11sqlm_a_initP8sqeAgent + 0x0469
0x00002ADE30890E95 _ZN14sqeApplication20InitEngineComponentsEcP8sqeAgentP8SQLE_BWAP5sqlcaP22SQLESRSU_STATUS_VECTORc + 0x0975
0x00002ADE3088EBAD _ZN14sqeApplication13AppStartUsingEP8SQLE_BWAP8sqeAgentccP5sqlcaPc + 0x075d
0x00002ADE30885A69 _ZN14sqeApplication13AppLocalStartEP14db2UCinterface + 0x0579
0x00002ADE30A8D970 _Z11sqlelostWrpP14db2UCinterface + 0x0040
0x00002ADE30A8C845 _Z14sqleUCengnInitP14db2UCinterfacet + 0x06f5
0x00002ADE30A8B1F1 sqleUCagentConnect + 0x04b1
0x00002ADE30B97AF6 _Z18sqljsConnectAttachP13sqljsDrdaAsCbP14db2UCinterface + 0x00b6
0x00002ADE30B5B289 _Z16sqljs_ddm_accsecP14db2UCinterfaceP13sqljDDMObject + 0x03b9
0x00002ADE30B50648 _Z17sqljsParseConnectP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x0058
0x00002ADE349A0D77 _Z10sqljsParseP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x0377
0x00002ADE30B4A8E4 address: 0x00002ADE30B4A8E4 ; dladdress: 0x00002ADE2CA95000 ; offset in lib: 0x00000000040B58E4 ;
0x00002ADE30B48EC9 address: 0x00002ADE30B48EC9 ; dladdress: 0x00002ADE2CA95000 ; offset in lib: 0x00000000040B3EC9 ;
0x00002ADE30B45F69 address: 0x00002ADE30B45F69 ; dladdress: 0x00002ADE2CA95000 ; offset in lib: 0x00000000040B0F69 ;
0x00002ADE30B45B5B _Z17sqljsDrdaAsDriverP18SQLCC_INITSTRUCT_T + 0x00eb
0x00002ADE3086BE91 _ZN8sqeAgent6RunEDUEv + 0x08b1
0x00002ADE31E0DCA4 _ZN9sqzEDUObj9EDUDriverEv + 0x00f4
0x00002ADE31661617 sqloEDUEntry + 0x02f7
0x0000003DB3C0683D address: 0x0000003DB3C0683D ; dladdress: 0x0000003DB3C00000 ; offset in lib: 0x000000000000683D ;
0x0000003DB30D4FCD clone + 0x006d
</StackTrace>

 

Investigation of corefile, memory diagnostic data shows that problem is with __intel_ssse3_rep_memcpy(). 
But based on the corefile we can see the copy requests from DB2 is always correct (i.e we have valid source, destination and length). 
Further research shows that the memcpy problem happens when processor's cache size is incorrectly registered as 0 KB. 
 

If you see such serious situation, check the cache size in /proc/cpuinfo, as follows 

$ egrep "cache size" /proc/cpuinfo 
cache size      : 0 KB 
cache size      : 0 KB 
cache size      : 0 KB 
cache size      : 0 KB 

 
If any of them were 0 KB, this is wrong and makes intel memcpy behave incorrectly because if affects its internal block size and cache size 0 could result in overrun of the copy beyond the expected end of destination. To our best knowledge, this means that the Intel processor hardware incorrectly reported the processor cache size to Linux operating system. 

 
Recommended actions are: 
- Power off the HW after shutdown the Linux OS, then poweron HW, boot the Linux OS and see if the cache size in /proc/cpuinfo becomes correct value, i.e, positive value. If it is still 0, engage HW and/or Linux support to fix it 

OR 

- If the Linux was running under Virtual machine OS such as VMWare, KVM, then restart everything including power off/on of the HW, boot the VM OS/VM/Linux, and see if the cache size in /proc/cpuinfo becomes correct value, i.e, positive value. If it is still 0, engage HW and/or Linux support to fix it. 

 
 
Last important question - why it is not indicating any issue any where in the system and only with DB2 ? 

>>> DB2 uses Intel Fast memcpy library embedded in db2sysc binary in stead of using built-in memcpy in Linux OS for performance reason. 
When Intel Fast memcpy behaved incorrectly, i.e. copied beyond the expected end of destination, memory corruption happens elsewhere in DB2 for a variety of operations. Non-DB2 operations/products rarely use Intel Fast memcpy, but typically use built-in memcpy in Linux OS, this is the reason you see the problems only with DB2. 

 

Thanks,

Shashank Kharche
IBM DB2 LUW Lab

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13286629