时间:2021-07-01 10:21:17 帮助过:29人阅读
客户一套10.2.0.4的数据库,一个实例突然的Crash掉了。客户想让我们帮忙分析宕机的原因。对于这种数据库突然Crash的问题,我们首先就会看数据库的Alert日志,可以看到在宕机之前,SMON进程报了ORA-00600[15709]的错误,紧接数据库就输出了一条信息“Fatal in
客户一套10.2.0.4的数据库,一个实例突然的Crash掉了。客户想让我们帮忙分析宕机的原因。对于这种数据库突然Crash的问题,我们首先就会看数据库的Alert日志,可以看到在宕机之前,SMON进程报了ORA-00600[15709]的错误,紧接数据库就输出了一条信息“Fatal internal error happened while SMON was doing active transaction recovery.”也就是说SMON在做活动事务恢复的时候出现了异常。最终导致了数据库实例的宕机。日志输出如下所示:
Fri Sep 26 10:53:35 2014 Errors in file /oracle/app/oracle/admin/wxyydb/bdump/wxyydb_smon_28997.trc: ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], [] ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA Fri Sep 26 10:53:55 2014 Fatal internal error happened while SMON was doing active transaction recovery. Fri Sep 26 10:53:55 2014 Errors in file /oracle/app/oracle/admin/wxyydb/bdump/wxyydb_smon_28997.trc: ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], [] ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA SMON: terminating instance due to error 474 Termination issued to instance processes. Waiting for the processes to exit Fri Sep 26 10:54:05 2014 Instance termination failed to kill one or more processes Instance terminated by SMON, pid = 28997
我们再来分析一下wxyydb_smon_28997.trc文件的信息。可以看到数据库的SMON进程一直尝试在做并行恢复事务。在恢复的过程中遇到了ORA-00600错误,最终底层代码异常触发了数据库的宕机。
*** 2014-09-26 10:10:36.236
Parallel Transaction recovery caught error 30319
*** 2014-09-26 10:15:10.643
Parallel Transaction recovery caught exception 30319
*** 2014-09-26 10:15:21.816
Parallel Transaction recovery caught error 30319
*** 2014-09-26 10:19:51.707
Parallel Transaction recovery caught exception 30319
*** 2014-09-26 10:53:35.830
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [15709], [29], [1], [], [], [], [], []
ORA-30319: Message 30319 not found; product=RDBMS; facility=ORA
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+64 call ksedst1() 000000000 ? 000000001 ?
ksedmp()+2176 call ksedst() 000000000 ?
C000000000000C9F ?
4000000004057F40 ?
000000000 ? 000000000 ?
000000000 ?
ksfdmp()+48 call ksedmp() 000000003 ?
kgeriv()+336 call ksfdmp() C000000000000695 ?
000000003 ?
40000000095185E0 ?
00000EC33 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ?
kgeasi()+416 call kgeriv() 6000000000031770 ?
6000000000032828 ?
4000000001A504E0 ?
000000002 ?
9FFFFFFFFFFFA138 ?
$cold_kxfpqsrls()+1 call kgeasi() 6000000000031770 ?
168 9FFFFFFFFD3D2290 ?
000003D5D ? 000000002 ?
000000002 ? 0000003E7 ?
000003D5D ?
9FFFFFFFFD3D22A0 ?
kxfpqrsod()+1104 call $cold_kxfpqsrls() C0000004FDF7A838 ?
C0000004FDF74430 ?
000000004 ?
9FFFFFFFFFFFA200 ?
C0000000000011AB ?
4000000003AA1250 ?
00000EDF5 ? 000000001 ?
kxfpdelqrefs()+640 call kxfpqrsod() C0000004FDF74430 ?
000000001 ?
60000000000B6300 ?
C000000000000694 ?
4000000003DD14F0 ?
00000EE2D ?
60000000000C6708 ?
kxfpqsod_qc_sod()+2 call kxfpdelqrefs() 00000003E ? 000000001 ?
016 60000000000B6300 ?
C000000000001028 ?
40000000025DE5A0 ?
4000000001B1A110 ?
60000000000C2D04 ?
60000000000C2E90 ?
kxfpqsod()+816 call kxfpqsod_qc_sod() 000000010 ? 000000001 ?
9FFFFFFFFFFFA260 ?
60000000000B6300 ?
9FFFFFFFFFFFA7F0 ?
C000000000001028 ?
40000000025DF810 ?
00000EE65 ?
ktprdestroy()+208 call kxfpqsod() C0000004FDF7A838 ?
000000001 ?
9FFFFFFFFFFFA810 ?
60000000000B6300 ?
9FFFFFFFFFFFAD90 ?
ktprbeg()+8272 call ktprdestroy() C000000000001026 ?
40000000025615B0 ?
000006E61 ? 000000000 ?
4000000001052E40 ?
000000000 ?
ktmmon()+10096 call ktprbeg() 9FFFFFFFFFFFBE70 ?
9FFFFFFFFFFFADA0 ?
60000000000B6300 ?
40000000028B75A0 ?
00000EF21 ?
9FFFFFFFFFFFADD8 ?
9FFFFFFFFFFFADE0 ?
ktmSmonMain()+64 call ktmmon() 9FFFFFFFFFFFD140 ?
ksbrdp()+2816 call ktmSmonMain() C000000100E1CA60 ?
C000000000000FA5 ?
000007361 ?
4000000003B5AE10 ?
C000000000000205 ?
400000000409DCD0 ?
opirip()+1136 call ksbrdp() 9FFFFFFFFFFFD150 ?
60000000000B6300 ?
9FFFFFFFFFFFDC90 ?
4000000002863EF0 ?
000004861 ?
C000000000000B1D ?
60000000000318F0 ?
$cold_opidrv()+1408 call opirip() 9FFFFFFFFFFFEA70 ?
000000004 ?
9FFFFFFFFFFFF090 ?
9FFFFFFFFFFFDCA0 ?
60000000000B6300 ?
C000000000000DA1 ?
sou2o()+336 call $cold_opidrv() 000000032 ?
9FFFFFFFFFFFF090 ?
60000000000C2C78 ?
$cold_opimai_real() call sou2o() 9FFFFFFFFFFFF0B0 ?
+640 000000032 ? 000000004 ?
9FFFFFFFFFFFF090 ?
main()+368 call $cold_opimai_real() 000000003 ? 000000000 ?
main_opd_entry()+80 call main() 000000003 ?
9FFFFFFFFFFFF598 ?
60000000000B6300 ?
C000000000000004 ?
根据ORA-00600[15709],我们在Oracle Support上找到一篇文档,SMON may fail with ORA-00600 [15709] Errors Crashing the Instance (文档 ID 736348.1),这篇文档的错误信息和我们所报出来的信息雷同。这篇文档列出了出现错误的堆栈情况:kxfpqsrls <- kxfpqrsod <- kxfpdelqrefs <- kxfpqsod_qc_sod <- kxfpqsod <- ktprdestroy <- ktprbe <- ktmmon。我们可以从SMON的Trace里面看到,堆栈内容基本上和这个匹配。所以,这个问题是在恢复的过程中命中了bug 695472,而如果你安装了这个patch,还是有类似的问题,很可能是遇到了另外一个类似的bug 9233544,Oracle的Bug还真是多啊。
bug 695472会影响9.2.0.8和10.2.0.4这两个版本,并且在10.2.0.4.2和10.2.0.5,11.1.0.7,11.2.0.1上得到了修复。解决bug 695472的方法是:
1.Use the following workaround
Set fast_start_parallel_rollback=false and recovery_parallelism=0
OR
2.Apply one-off <
OR
3.Upgrade to fixed release 10.2.0.5, 11.1.0.7 or 11.2.0.1.
bug 9233544会影响10.2.0.4,11.1.0.7和11.2.0.1这三个版本,并且在11.2.0.3和12.1上得到了修复,解决bug 9233544的方法是:
1.Apply patchset 11.2.0.3, in which Bug: 9233544 is fixed.
OR
2.Check if one-off Patch:9233544 is available for your release and platform here.
我们仔细检查了一下系统的补丁,发现系统已经安装了patch 6954722,那就证明是bug 9233544影响的。要么升级到11.2.0.3的版本,要么就是安装单独的patch 9233544。对于升级11.2.0.3这个动作太大了,给客户说了一下考虑安装小patch来解决。
原文地址:ORA-00600: internal error code, arguments: [15709], 感谢原作者分享。