Tuesday, January 30, 2018

RCA: Core dump in Solaris sparc V9

As long as you have a core dump, you have the root cause.

There are several useful tools coming within a Solaris system. You can use pstack to have a quick look at the stack information when core happens:

-bash-4.1# pstack ./protectpoint.01.05.18
core './protectpoint.01.05.18' of 27684:        protectpoint backup show list scan dd-system primary
 fffffffc8f97b678 _ZN10rpSnapshot3endEv (1044c6480, 100000000, 1044c6480, 1, fffffffc8f97b578, 100000000) + 100
 00000001001b0bdc _ZN11SnapWrapper3endEv (1014308f8, 100946808, 4be, 3286960, 3286960, 100000000) + 154
 00000001001483c8 _ZN7SCMRPPP39nsm_centaur_parseSnapshotBufferStimgIDsEPKcPK13dd_stimg_infoPj (101430870, 10306c848, 0, ffffffff7ffff1bc, aca220, 100000000) + 828
 00000001001674f0 ca_nsm_centaur_get_snapshotBufferStimgIDs (10306c848, ffffffff7ffff1c0, ffffffff7ffff1bc, 0, ff000000000000, ffffffff7ffff220) + b0
 000000010011a220 ???????? (101427650, 102fffa50, 103, 1008ae588, 103, ffffffff7ffff3d8)
 0000000100119c7c _execute_cmd_ca_backup_list (ffffffff7ffff3d8, ffffffff7ffff4e0, 0, 0, 1, ffffffff7ffff4f8) + 1794
 00000001001070f8 ???????? (100ac8308, 7, ffffffff7ffff9b8, 1e, 3d, 2944)
 0000000100107454 _run_cmds (7, ffffffff7ffff9b8, ffffffff7ffff728, 1e, 0, ffffffff7ffff71c) + 1ec
 0000000100106964 ???????? (7, ffffffff7ffff9b8, 178, 937f8, 100b08c48, 93400)
 00000001001069e8 main (7, ffffffff7ffff9b8, ffffffff7ffff9f8, ffffffff7e500380, 2880, 2800) + 18
 00000001000e9a08 _start (0, 0, 0, 0, 0, 100a75450) + 108

Check if there are debuggers like gdb, dbx in the system. If yes, then go and use them. If not you may try something else like adb and mdb. Personally I'd like to use mdb if gdb/dbx is not available on the system.

-bash-4.1# mdb /path/to/your/bin   /path/to/your/core

Next, print stack with $C
> $C
ffffffff7fffe271 libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x100(1044c6480, 100000000, 1044c6480, 1,
fffffffc8f97b578, 100000000)
ffffffff7fffe3c1 _ZN11SnapWrapper3endEv+0x154(1014308f8, 100946808, 4be, 3286960, 3286960, 100000000)
ffffffff7fffe4b1 _ZN7SCMRPPP39nsm_centaur_parseSnapshotBufferStimgIDsEPKcPK13dd_stimg_infoPj+0x828(101430870,
10306c848, 0, ffffffff7ffff1bc, aca220, 100000000)
ffffffff7fffe7a1 ca_nsm_centaur_get_snapshotBufferStimgIDs+0xb0(10306c848, ffffffff7ffff1c0, ffffffff7ffff1bc, 0,
ff000000000000, ffffffff7ffff220)
ffffffff7fffe8c1 0x10011a220(101427650, 102fffa50, 103, 1008ae588, 103, ffffffff7ffff3d8)
ffffffff7fffea71 _execute_cmd_ca_backup_list+0x1794(ffffffff7ffff3d8, ffffffff7ffff4e0, 0, 0, 1, ffffffff7ffff4f8)
ffffffff7fffec21 0x1001070f8(100ac8308, 7, ffffffff7ffff9b8, 1e, 3d, 2944)
ffffffff7fffed51 _run_cmds+0x1ec(7, ffffffff7ffff9b8, ffffffff7ffff728, 1e, 0, ffffffff7ffff71c)
ffffffff7fffee51 0x100106964(7, ffffffff7ffff9b8, 178, 937f8, 100b08c48, 93400)
ffffffff7ffff031 main+0x18(7, ffffffff7ffff9b8, ffffffff7ffff9f8, ffffffff7e500380, 2880, 2800)
ffffffff7ffff101 _start+0x108(0, 0, 0, 0, 0, 100a75450)

You would see the same output as that from pstack.

You can demangle function names with c++filt, e.g.
fos@ubuntu:/usr/src$ echo _ZN10rpSnapshot3endEv |c++filt
rpSnapshot::end()
fos@ubuntu:/usr/src$ echo _ZN11SnapWrapper3endEv |c++filt
SnapWrapper::end()

Up to now, we can see that the process died in function rpSnapshot::end() in a library named libprotectpoint_rp.so.

And the exact address at which the process died is _ZN10rpSnapshot3endEv+0x100
We will figure out at which line exactly that the process died at in the next section.
Before we move on, let's print the general registers with $?
> $?
no process
SIGSEGV: Segmentation Fault
%g0 = 0x0000000000000000                 %l0 = 0xffffffff7f617400 ld.so.1`_rt_boot+0x5c
%g1 = 0x0000000000718ddc                 %l1 = 0x0000000000183000
%g2 = 0x00000001044c64c0                 %l2 = 0xffffffffffe7cc08
%g3 = 0x0000000000000000                 %l3 = 0xfffffffc9009e3f0
%g4 = 0x0000000000000000                 %l4 = 0xfffffffc8ff1aff8
%g5 = 0x0000000000000000                 %l5 = 0xfffffffc8ff1aff8
%g6 = 0x0000000000000000                 %l6 = 0xffffffff7e98f904 libc.so.1`_uberdata+0x2944
%g7 = 0xffffffff7e602a80                 %l7 = 0x0000000000002944
%o0 = 0x0000000000000000                 %i0 = 0x00000001044c6480
%o1 = 0x0000000000000009                 %i1 = 0x0000000100000000
%o2 = 0xfffffffc8ff1afa8                 %i2 = 0x00000001044c6480
%o3 = 0x00000000000000dc                 %i3 = 0x0000000000000001
%o4 = 0xfffffffc8ff1aff8                 %i4 = 0xfffffffc8f97b578 libprotectpoint_rp.so`_ZN10rpSnapshot3endEv
%o5 = 0x0000000000000000                 %i5 = 0x0000000100000000
%o6 = 0xffffffff7fffe271                 %i6 = 0xffffffff7fffe3c1
%o7 = 0xfffffffc8f97b664 libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xec %i7 = 0x00000001001b0bdc
_ZN11SnapWrapper3endEv+0x154

 %ccr = 0x44 xcc=nZvc icc=nZvc
   %y = 0x0000000000000000
  %pc = 0xfffffffc8f97b678 libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x100
 %npc = 0xfffffffc8f97b67c libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x104
  %sp = 0xffffffff7fffe271
  %fp = 0xffffffff7fffe3c1

 %asi = 0x82
%fprs = 0x07

The above output gives us information o about the signal no.(SIGSEGV) and the values for the general registers. We will use some of the register values for the analysis in the next section.


Now let's have a look at the source code of  rpSnapshot::end():

1 void rpSnapshot::end() {
2   SSError *err = NULL;
3    char* fname = CLASSISnapshot(end);

4    RPENTER(fname);

5    try {
6        m_state = SSHOT_STATE_UNINIT;
7        RPDBGLOG(DEBUGPD, "setting snapshot state to UNINIT");
8        err = m_rpSnapshotPriv->end();
9        if (err != NULL) {
10            RPDBGLOG(DEBUGPI, "Error ending snapshot session: %s",
11                     err->getErrMsg());
12            delete err;
13        }
14    } catch (std::exception & exp) {
15        RPLOG(*msg_create(103000, ERRSEV_UNRECOVERABLE,
16            "Caught an exception while ending RecoverPoint snapshot operation - %s",
17            ARG_ERROR(exp.what())));
18    }

19    m_state = SSHOT_STATE_ERROR;
20    RPDBGLOG(DEBUGPD, "setting snapshot state to ERROR");
    // Reset the flag.
21    m_initDone = FALSE;

22    RPEXIT(fname);
}

The line bolded may stands out as that's the only thing rpSnapshot does in the implementation.
Note that it is an auto_ptr:
auto_ptr m_rpSnapshotPriv;

Next let's look into the disassembles.
> _ZN10rpSnapshot3endEv+0x100 ::dis
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xd8:       mov       0xdc, %o3
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xdc:
call      +0x72cf0c     _ZNK3emc6nsm_rp8RpLogger9logprintfEiPKciS3_z
>
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xe0:       nop
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xe4:       ldx       [%fp + 0x7ef], %g2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xe8:       add       %g2, 0x40, %o0
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xec:
call      +0x9fa4       _ZNKSt8auto_ptrI14rpSnapshotPrivEptEv
>
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xf0:       nop
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xf4:       mov       %o0, %g3
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xf8:       stx       %g3, [%fp + 0x7c7]
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0xfc:       mov       %g3, %g4
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x100:      ld        [%g4], %o2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x104:      ld        [%g4 + 0x4], %o3
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x108:      srl       %o3, 0x0, %o4
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x10c:      sllx      %o2, 0x20, %o5
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x110:      or        %o4, %o5, %o7
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x114:      ld        [%o7 + 0x18], %l2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x118:      ld        [%o7 + 0x1c], %l3
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x11c:      srl       %l3, 0x0, %l4
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x120:      sllx      %l2, 0x20, %l5
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x124:      or        %l4, %l5, %l6
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x128:      mov       %g3, %o0


Notice the function names and const names in the disassembles can help us locate the place where the core dumps.

From the above disassemble output, the core happens right after _ZNKSt8auto_ptrI14rpSnapshotPrivEptEv is called.
fos@ubuntu:/usr/src$ echo _ZNKSt8auto_ptrI14rpSnapshotPrivEptEv |c++filt
std::auto_ptr::operator->() const

Looking back into the source code, that's the line we have suspected and marked:
        err = m_rpSnapshotPriv->end();

The operator-> of auto_ptr dereference object member and returns the element pointed by the auto_ptr object in order to access one of its members. It returns NULL if this auto_ptr does not point to any object.
reference: http://www.cplusplus.com/reference/memory/auto_ptr/operator-%3E/

Up to now we have find the line of source code which causes the core and why it generates the core:
m_rpSnapshotPriv does not point to any object!!!

You can try to disassemble a wider address to catch something useful. For example, we continue to diassemble with
> _ZN10rpSnapshot3endEv+0x128 ::dis
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x100:      ld        [%g4], %o2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x104:      ld        [%g4 + 0x4], %o3
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x108:      srl       %o3, 0x0, %o4
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x10c:      sllx      %o2, 0x20, %o5
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x110:      or        %o4, %o5, %o7
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x114:      ld        [%o7 + 0x18], %l2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x118:      ld        [%o7 + 0x1c], %l3
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x11c:      srl       %l3, 0x0, %l4
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x120:      sllx      %l2, 0x20, %l5
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x124:      or        %l4, %l5, %l6
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x128:      mov       %g3, %o0
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x12c:      jmpl      %l6, %o7
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x130:      nop
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x134:      mov       %o0, %l7
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x138:      stx       %l7, [%fp + 0x7b7]
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x13c:      ldx       [%fp + 0x7b7], %g2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x140:      stx       %g2, [%fp + 0x7e7]
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x144:      ldx       [%fp + 0x7e7], %g2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x148:      cmp       %g2, 0x0
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x14c:
be,pn     %xcc, +0xd4   _ZN10rpSnapshot3endEv
+0x220>
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x150:      nop
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x154:
call      +0x72cd14     _ZN3emc6nsm_rp8RpLogger8instanceEv
>
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x158:      nop
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x15c:      mov       %o0, %g2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x160:      stx       %g2, [%fp + 0x79f]
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x164:      ldx       [%fp + 0x7e7], %o0
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x168:      call      +0x72cec0     _ZN7SSError9getErrMsgEv
>
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x16c:      nop
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x170:      mov       %o0, %g2
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x174:      stx       %g2, [%fp + 0x797]
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x178:      ldx       [%fp + 0x79f], %g2

You'll see more function names and const names after the line where core happens, which help you narrow down the location.  For example, the 
fos@ubuntu:/usr/src$ echo _ZN3emc6nsm_rp8RpLogger8instanceEv |c++filt
emc::nsm_rp::RpLogger::instance() 
===> line 10:  RPDBGLOG(DEBUGPI, 

fos@ubuntu:/usr/src$ echo _ZN7SSError9getErrMsgEv |c++filt
SSError::getErrMsg()
===> line 11: err->getErrMsg()

If you know well about assembly language,
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x148:      cmp       %g2, 0x0
is for the source code at line 9:
9        if (err != NULL) {

From the above we can tell, the core dump happens before line 9.


Plus, let's have a look at the line where core happens in the diassembles output:
libprotectpoint_rp.so`_ZN10rpSnapshot3endEv+0x100:      ld        [%g4], %o2

If you scroll up you can find the values for g4 and o2
%g4 = 0x0000000000000000    %o2 = 0xfffffffc8ff1afa8

the assembly command 
    ld [%g4], %o2
moves data pointed by g4 to register o2. Note %g4 is point to 0x0000000000000000  !!!!
That' the culprit.


With only three commands($C, $?, address::dis) in mdb, you can debug a core dump and figure out the root cause of it.
Try it out and have fun.

The next post will look into the real problems behind this core dump - the culprit - why we hit this corner case, why the auto_ptr points to no object.

No comments: