Java进程奔溃故障排查

来源：互联网发布：阐释者淘宝编辑：程序博客网时间：2024/05/29 05:08

本文对系统奔溃原因提供了一些特定场景下数据分析信息和排查指南

A crash, or fatal error, causes a process to terminate abnormally. There are various possible reasons for a crash. For example, a crash can occur due to a bug in the HotSpot VM, in a system library, in a Java SE library or API, in application native code, or even in the operating system. External factors, such as resource exhaustion in the operating system can also cause a crash.

Crashes caused by bugs in the HotSpot VM or in the Java SE library code are rare. This chapter provides suggestions on how to examine a crash. In some cases it is possible work around a crash until the cause of the bug is diagnosed and fixed.

In general the first step with any crash is to locate the fatal error log. This is a text file that the HotSpot VM generates in the event of a crash. SeeAppendix C, Fatal Error Log for an explanation of how to locate this file, as well as a detailed description of the file.

4.1 奔溃场景

这部分给出了一些实例，展示了我们怎么分析错误日志，从而得出奔溃的原因。

4.1.1 找出什么是导致系统奔溃的地方

The error log header indicates the problematic frame. SeeC.3 Header Format.

If the top frame type is a native frame and not one of the operating system native frames, then this indicates that the problem is likely in that native library and not in the Java virtual machine. The first step to solving this crash is to investigate the source of the native library where the crash occurred. There are three options, depending on the source of the native library.

If the native library is provided by your application, then investigate the source code of your native library. The option-Xcheck:jni can help find many native bugs. SeeB.2.1-Xcheck:jni Option.
If the native library has been provided by another vendor and is used by your application, then file a bug report against this third-party application and provide the fatal error log information.
Determine if the native library is part of the Java runtime environment (JRE) by looking in thejre/lib orjre/bin directories in the JRE distribution. If so, file a bug report, and ensure that this library name is prominently indicated so that the bug report can be routed to the appropriate developers.

If the top frame indicated in the error log is another type of frame, file a bug report and include the fatal error log as well as any information on how to reproduce the problem.

See also the remaining sections in this chapter.

4.1.2 Crash in Native Code

If the fatal error log indicates that the crash was in a native library, there might be a bug in native code or JNI library code. The crash could of course be caused by something else, but analysis of the library and any core file or crash dump is a good starting place. For example, consider the following extract from the header of a fatal error log:

# An unexpected error has been detected by HotSpot Virtual Machine:##  SIGSEGV (0xb) at pc=0x417789d7, pid=21139, tid=1024## Java VM: Java HotSpot(TM) Server VM (6-beta2-b63 mixed mode)# Problematic frame:# C  [libApplication.so+0x9d7]

In this case a SIGSEGV occurred with a thread executing in the librarylibApplication.so.

In some cases a bug in a native library manifests itself as a crash in Java VM code. Consider the following crash where aJavaThread fails while in the_thread_in_vm state (meaning that it is executing in Java VM code) :

# An unexpected error has been detected by HotSpot Virtual Machine:##  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x08083d77, pid=3700, tid=2896## Java VM: Java HotSpot(TM) Client VM (1.5-internal mixed mode)# Problematic frame:# V  [jvm.dll+0x83d77]---------------  T H R E A D  ---------------Current thread (0x00036960):  JavaThread "main" [_thread_in_vm, id=2896] :Stack: [0x00040000,0x00080000),  sp=0x0007f9f8,  free space=254kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)V  [jvm.dll+0x83d77]C  [App.dll+0x1047]          <========= C/native framej  Test.foo()V+0j  Test.main([Ljava/lang/String;)V+0v  ~StubRoutines::call_stubV  [jvm.dll+0x80f13]V  [jvm.dll+0xd3842]V  [jvm.dll+0x80de4]V  [jvm.dll+0x87cd2]C  [java.exe+0x14c0]C  [java.exe+0x64cd]C  [kernel32.dll+0x214c7] :

In this case the stack trace shows that a native routine inApp.dll has called into the VM (probably with JNI).

If you get a crash in a native application library (as in the above examples), then you might be able to attach the native debugger to the core file or crash dump, if it is available. Depending on the operating system, the native debugger isdbx,gdb, or windbg.

Another approach is to run with the -Xcheck:jni option added to the command line (seeB.2.1-Xcheck:jni Option). This option is not guaranteed to find all issues with JNI code, but it can help identify a significant number of issues.

If the native library where the crash occurred is part of the Java runtime environment (for exampleawt.dll,net.dll, and so forth), then it is possible that you have encountered a library or API bug. If after further analysis you conclude this is a library or API bug, then gather a much data as possible and submit a bug or support call. See Chapter 7, Submitting Bug Reports.

4.1.3 Crash due to Stack Overflow

A stack overflow in Java language code will normally result in the offending thread throwingjava.lang.StackOverflowError. On the other hand, C and C++ write past the end of the stack and provoke a stack overflow. This is a fatal error which causes the process to terminate.

In the HotSpot implementation, Java methods share stack frames with C/C++ native code, namely user native code and the virtual machine itself. Java methods generate code that checks that stack space is available a fixed distance towards the end of the stack so that the native code can be called without exceeding the stack space. This distance towards the end of the stack is called“Shadow Pages.” The size of the shadow pages is between 3 and 20 pages, depending on the platform. This distance is tunable, so that applications with native code needing more than the default distance can increase the shadow page size. The option to increase shadow pages is-XX:StackShadowPages=n, wheren is greater than the default stack shadow pages for the platform.

If your application gets a segmentation fault without a core file or fatal error log file (seeAppendix C, Fatal Error Log) or aSTACK_OVERFLOW_ERROR on Windows or the message “An irrecoverable stack overflow has occurred,” this indicates that the value ofStackShadowPages was exceeded and more space is needed.

If you increase the value of StackShadowPages, you might also need to increase the default thread stack size using the-Xssparameter. Increasing the default thread stack size might decrease the number of threads that can be created, so be careful in choosing a value for the thread stack size. The thread stack size varies by platform from 256k to 1024k.

The following is a fragment from a fatal error log, on a Windows system, where a thread has provoked a stack overflow in native code.

# An unexpected error has been detected by HotSpot Virtual Machine:##  EXCEPTION_STACK_OVERFLOW (0xc00000fd) at pc=0x10001011, pid=296, tid=2940## Java VM: Java HotSpot(TM) Client VM (1.6-internal mixed mode, sharing)# Problematic frame:# C  [App.dll+0x1011]#---------------  T H R E A D  ---------------Current thread (0x000367c0):  JavaThread "main" [_thread_in_native, id=2940]:Stack: [0x00040000,0x00080000),  sp=0x00041000,  free space=4kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)C  [App.dll+0x1011]C  [App.dll+0x1020]C  [App.dll+0x1020]:C  [App.dll+0x1020]C  [App.dll+0x1020]...<more frames>...Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)j  Test.foo()V+0j  Test.main([Ljava/lang/String;)V+0v  ~StubRoutines::call_stub

Note the following information in the above output:

The exception is EXCEPTION_STACK_OVERFLOW.
The thread state is _thread_in_native, which means that the thread is executing native or JNI code.
In the stack information the free space is only 4k (a single page on a Windows system). In addition, the stack pointer (sp) is at 0x00041000, which is close to the end of the stack (0x00040000).
The printout of the native frames shows that a recursive native function is the issue in this case.
The output notation ...<more frames>...indicates that additional frames exist but were not printed. The output is limited to 100 frames.

4.1.4 Crash in the HotSpot Compiler Thread

If the fatal error log output shows that the Current thread is aJavaThread namedCompilerThread0, CompilerThread1, orAdapterCompiler, then it is possible that you have encountered a compiler bug. In this case it might be necessary to temporarily work around the issue by switching the compiler (for example, by using the HotSpot Client VM instead of the HotSpot Server VM, or visa versa), or by excluding from compilation the method that provoked the crash. This is discussed in4.2.1 Crash in HotSpot Compiler Thread or Compiled Code.

4.1.5 Crash in Compiled Code

If the crash occurred in compiled code, then it is possible that you have encountered a compiler bug that has resulted in incorrect code generation. You can recognize a crash in compiled code if the problematic frame is marked with the codeJ (meaning a compiled Java frame). Below is an example of a such a crash:

# An unexpected error has been detected by HotSpot Virtual Machine:##  SIGSEGV (0xb) at pc=0x0000002a99eb0c10, pid=6106, tid=278546## Java VM: Java HotSpot(TM) 64-Bit Server VM (1.6.0-beta-b51 mixed mode)# Problematic frame:# J  org.foobar.Scanner.body()V#:Stack: [0x0000002aea560000,0x0000002aea660000),  sp=0x0000002aea65ddf0,  free space=1015kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)J  org.foobar.Scanner.body()V[error occurred during error reporting, step 120, id 0xb]

Note that a complete thread stack is not available. The output line “error occurred during error reporting”means that a problem arose trying to obtain the stack trace (perhaps stack corruption in this example).

It might be possible to temporarily work around the issue by switching the compiler (for example, by using the HotSpot Client VM instead of the HotSpot Server VM, or visa versa) or by excluding from compilation the method that provoked the crash. In this specific example it might not be possible to switch the compiler as it was taken from the 64-bit Server VM and hence it might not be feasible to switch to the 32-bit Client VM.

4.1.6 Crash in `VMThread`

If the fatal log output shows that the Current threadis the VMThread, then look for the line containingVM_Operation in theTHREAD section. TheVMThread is a special thread in the HotSpot VM. It performs special tasks in the VM such as garbage collection (GC). If theVM_Operation suggests that the operation is a garbage collection, then it is possible that you have encountered an issue such as heap corruption.

The crash might also be a GC issue, but it could equally be something else (such as a compiler or runtime bug) that leaves object references in the heap in an inconsistent or incorrect state. In this case, collect as much information as possible about the environment and try possible workarounds. If the issue is GC-related you might be able to temporarily work around the issue by changing the GC configuration. This is discussed in4.2.2 Crash During Garbage Collection.

4.2 Finding a Workaround

If a crash occurs with a critical application, and the crash appears to be caused by a bug in the HotSpot VM, then it might be desirable to quickly find a temporary workaround. The purpose of this section is to suggest some possible workarounds. If the crash occurs with an application that is deployed with the most recent release of the JDK, then the crash should always be reported to Oracle.

0 0