Viewing Core dumps CUCM

Viewing Core dumps CUCM

Core dumps, ah yes!  The boring stuff. Some times an  unavoidable part of a troubleshooting process. they can be particularly useful if you see certain services restart on your server for no apparent reason. This is a quick little post on how to obtain core dumps and how to capture their output and what to look for once you have the output.

What are they?
Core dumps occur when a Linux process experiences a fault.  This results in an outage of the affected process or service.  The process or service must restart to recover.  During these incidents, the server may remain up, but certain services may experience a brief outage. Core Dumps get created on all Linux appliances such as CUCM, CUPS and CUC


Where are they?
RTMT generates alerts in alerts central, that indicate Core Dumps have been created. These alerts will have a content such as:

 At Wed Aug 28 07:12:44 EST 2013 on node CUC1, the following CoreDumpFileFound events generated:  
TotalCoresFound : 1
CoreDetails : The following lists up to 6 cores dumped by corresponding applications. 
Core1 : Cisco Tomcat (core.17005.6.tomcat.1377637933) AppID : Cisco Log Partition Monitoring Tool ClusterID :  
NodeID : CUC1
 TimeStamp : Wed Aug 28 07:12:18 EST 2013

 
 
Now we have a time stamp, log onto the OS and check the available core dumps  (see below).
Figure 1 – Core dump file list on active partition

 

As you can see, the time stamp 7:12  coincides with the last available Tomcat core dump in our list. So lets have a look at that file, and see if we can produce some back traces from it.


What’s in them?

Let’s stay with the core dump that was created at 7:12 on august the 28th. Issue the following command:


utils core active analyze <file name>

and in this case:

utils core active analyze core.17005.6.tomcat.1377637933

Please note that this could potentially generated substantial I/O and preferably should be carried out after hours.

So running this analysis through the CLI will cause a certain amount of output. It is therefore advisable to log the console session so you can paste the output into something like Notepad++ for further analysis.

 

The most interesting part of the core dump analysis is the actual backtrace, this is where you will find an indication of what has gone wrong with a certain service. In this particular case it looked like this:

Program terminated with signal 6, Aborted.
#0  0x00139206 in raise () from /lib/libc.so.6
  ====================================
 backtrace
 ===================================
#0  0x00139206 in raise () from /lib/libc.so.6
#1  0x0013abd1 in abort () from /lib/libc.so.6
#2  0x015d9f7f in os::abort(bool) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#3  0x013dae3b in vm_abort(bool) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#4  0x012f38cc in report_vm_out_of_memory(char const*, int, unsigned int, char const*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#5  0x01341115 in ElfSymbolTable::ElfSymbolTable(_IO_FILE*, Elf32_Shdr) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#6  0x0134088c in ElfFile::load_tables() () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#7  0x01340589 in ElfFile::ElfFile(char const*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#8  0x012f6273 in Decoder::get_elf_file(char const*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#9  0x012f613c in Decoder::decode(unsigned char*, char const*, char*, int, int*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#10 0x015da425 in os::dll_address_to_function_name(unsigned char*, char*, int, int*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#11 0x01354204 in print_C_frame(outputStream*, char*, int, unsigned char*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#12 0x0172056b in VMError::report(outputStream*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#13 0x01721780 in VMError::report_and_die() () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#14 0x012f38bf in report_vm_out_of_memory(char const*, int, unsigned int, char const*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#15 0x01154e08 in ChunkPool::allocate(unsigned int) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#16 0x01154be6 in Arena::grow(unsigned int) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#17 0x01154d12 in Arena::Arealloc(void*, unsigned int, unsigned int) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#18 0x015c5757 in Node::out_grow(unsigned int) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#19 0x0111ade0 in Node::add_out(Node*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#20 0x012e6e84 in ConNode::make(Compile*, Type const*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/i386/server/libjvm.so
#21 0x016157e5 in PhaseValues::uncached_makecon(Type const*) () from /usr/local/thirdparty/java/j2sdk/jre/lib/

etc…………………..

 
What to do with them?
 
As you can see, very cryptic stuff, unless you are a developer or are pretty good with Linux. What I tend to do is run the output and search for words like “error”, “abort” or “restart”. Sometimes you will need to eyeball the bastard in its entirety. I realise it is a bit of a black art. Once you found something that might indicate some sort of description of the issue at hand, paste that into the Bug ID search tool on cisco.com. Make sure you do not paste in unique memory location identifiers in your search, because this will make your search too narrow.
 
Also check the up time of the service that was affected by the reported alert, by simply going to serviceability and checking the up time. This at least gives you an indication that something is definitely wrong and what services are being effected.  (In my particular case I could see that the Tomcat service had indeed been restarted at the given time stamp)
 
If you draw blanks, just log it with TAC and attach the backtrace.
 
Here is some additional reading on the subject, thanks to Matthew Taber:

https://supportforums.cisco.com/docs/DOC-14743