IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    mapid溢出导致Oracle RAC 节点重启

    admin发表于 2015-10-07 06:53:11
    love 0

    本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

    本文链接地址: mapid溢出导致Oracle RAC 节点重启

    某客户的CRM数据库在2015年9月17号11:38分左出现其中一个节点宕机的情况。其中节点1报错信息如下:

    Thu Sep 17 11:38:10 2015
    Use ADRCI or Support Workbench to package the incident.
    See Note 411.1 at My Oracle Support for error and packaging details.
    NOTE: ASMB terminating
    Errors in file /oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_asmb_11141424.trc:
    ORA-15064: communication failure with ASM instance
    ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1:crm2db], [], [], [], [], [], [], [], [], [], []
    ], [], [], [], [], [], [], [], [], []
    Errors in file /oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_asmb_11141424.trc:
    ORA-15064: communication failure with ASM instance
    ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600], [ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1:crm2db], [], [], [], [], [], [], [], [], [], []
    ], [], [], [], [], [], [], [], [], []
    ASMB (ospid: 11141424): terminating the instance due to error 15064
    Thu Sep 17 11:38:10 2015
    System state dump requested by (instance=1, osid=11141424 (ASMB)), summary=[abnormal instance termination].
    System State dumped to trace file /oracle/app/oracle/diag/rdbms/crm2db/crm2db1/trace/crm2db1_diag_12714592.trc
    Thu Sep 17 11:38:10 2015
    opiodr aborting process unknown ospid (22414298) as a result of ORA-1092
    Termination issued to instance processes. Waiting for the processes to exit
    Thu Sep 17 11:38:25 2015
    ORA-1092 : opitsk aborting process
    Thu Sep 17 11:38:26 2015
    Instance termination failed to kill one or more processes
    Instance terminated by ASMB, pid = 11141424
    Thu Sep 17 11:39:43 2015
    Starting ORACLE instance (normal)
    

    从节点1数据库db alert log日志来看,在11:38:10出现ASM故障,最后在11:38:26时间,节点1的ASM实例被Oracle ASM实例的ASMB进程强行终止,因此导致数据库实例也crash掉。如下是相关日志信息:

    Thu Sep 17 11:38:25 2015 ORA-1092 : opitsk aborting process
    Thu Sep 17 11:38:26 2015
    Instance termination failed to kill one or more processes
    Instance terminated by ASMB, pid = 11141424
    

    而此时在节点2的数据库alert log日志中,没有发现其他信息,仅仅只有节点1实例被重启的信息。因此,不难看出,节点ASM实例被强行终止,是此次问题的关键所在。我们进一步分析节点1 ASM实例的日志,发现如下信息:

    Tue Sep 15 21:51:52 2015
    Time drift detected. Please check VKTM trace file for more details.
    Thu Sep 17 11:38:06 2015
    Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8716976.trc  (incident=179751):
    ORA-00600: internal error code, arguments: [kffilCreate01], [crm2db1:crm2db], [], [], [], [], [], [], [], [], [], []
    Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_179751/+ASM1_ora_8716976_i179751.trc
    Use ADRCI or Support Workbench to package the incident.
    See Note 411.1 at My Oracle Support for error and packaging details.
    Thu Sep 17 11:38:10 2015
    Dumping diagnostic data in directory=[cdmp_20150917113809], requested by (instance=1, osid=8716976), summary=[incident=179751].
    Thu Sep 17 11:38:24 2015
    Sweep [inc][179751]: completed
    Sweep [inc2][179751]: completed
    Thu Sep 17 11:39:29 2015
    NOTE: ASM client crm2db1:crm2db disconnected unexpectedly.
    NOTE: check client alert log.
    NOTE: Trace records dumped in trace file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8716976.trc
    Thu Sep 17 11:39:29 2015
    System State dumped to trace file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_42729726.trc
    Thu Sep 17 11:40:10 2015
    NOTE: client crm2db1:crm2db registered, osid 50266474, mbr 0x1
    

    我们可以看到,在11:38:06时间点,ASM实例抛出ORA-00600 [kffilCreate01] 错误后,接着将实例终止,同时将日志信息写入到+ASM1_ora_8716976_i179751.trc文件中,该trace 文件的内容如下:

    Address: 0xfffffffffff3cc8
    Incident ID: 179751
    Problem Key: ORA 600 [kffilCreate01]
    Error: ORA-600 [kffilCreate01] [crm2db1:crm2db] [] [] [] [] [] [] [] [] [] []
    [00]: dbgexProcessError [diag_dde]
    [01]: dbgeExecuteForError [diag_dde]
    [02]: dbgePostErrorKGE [diag_dde]
    [03]: dbkePostKGE_kgsf [rdbms_dde]
    [04]: kgeadse []
    [05]: kgerinv_internal []
    [06]: kgerinv []
    [07]: kgeasnmierr []
    [08]: kffilCreate [ASM]<-- Signaling
    [09]: kfnsUFG [KFNU]
    [10]: kfnsBackground [KFNU]
    [11]: kfnDispatch [KFN]
    [12]: opiodr []
    [13]: ttcpip []
    [14]: opitsk []
    [15]: opiino []
    [16]: opiodr []
    [17]: opidrv []
    [18]: sou2o []
    [19]: opimai_real []
    [20]: ssthrdmain []
    [21]: main []
    

    我们可以看出,Oracle 在调用kffcicreate函数时出现了异常,进而导致asm实例被强行终止。经过分析我们发现该问题跟Oracle Bug 16357438  (ORA-600 [KFFILCREATE01] IN ASM TRIGGERS DB CRASH WHEN MAPID REACHES MAX VALUE )比较符合。如下是Oracle metalink文档针对该bug的call stack函数描述:

    [00]: dbgexProcessError [diag_dde]
    [01]: dbgeExecuteForError [diag_dde]
    [02]: dbgePostErrorKGE [diag_dde]
    [03]: dbkePostKGE_kgsf [rdbms_dde]
    [04]: kgeadse []
    [05]: kgerinv_internal []
    [06]: kgerinv []
    [07]: kgeasnmierr []
    [08]: kffilCreate [ASM]<-- Signaling
    [09]: kfnsUFG [KFNU]
    [10]: kfnsBackground [KFNU]
    [11]: kfnDispatch [KFN]
    [12]: opiodr []
    [13]: ttcpip []
    [14]: opitsk []
    [15]: opiino []
    [16]: opiodr []
    [17]: opidrv []
    [18]: sou2o []
    [19]: opimai_real []
    [20]: ssthrdmain []
    [21]: main []
    [22]: __start []
    

    对于Oracle数据库bug的判断,通常的做法是比较报错日志文件中的call stack函数调用是否与Oracle bug的call stack函数描述一致,一般来讲,相似度超多90%,则意味是符合bug的可能性就极大。基于这样的分析,我对比发现call stack与Oracle bug 的call stack描述完全一致。

    根据Oracle 文档关于该bug的描述,大致意思是指asm实例的mapid溢出,导出asm实例无法正常运作,进而被asmb进程强行终止;当然,强行终止的目的是为了保持集群节点数据的完整性,如下是从trace文件中提取的mapid信息:

    Roger$ cat ASM1_ora_8716976_i179751.trc |grep 'mapid:'
                (kffil) netnm: crm2db1:crm2db, mapid: 4293682398
                (kffil) netnm: crm2db1:crm2db, mapid: 4249534089
                (kffil) netnm: crm2db1:crm2db, mapid: 4185128984
                (kffil) netnm: crm2db1:crm2db, mapid: 4185125967
                (kffil) netnm: crm2db1:crm2db, mapid: 4185121734
                (kffil) netnm: crm2db1:crm2db, mapid: 4152108652
                (kffil) netnm: crm2db1:crm2db, mapid: 4139887931
                 .......
                (kffil) netnm: crm2db1:crm2db, mapid: 520724217
                (kffil) netnm: crm2db1:crm2db, mapid: 482740313
                (kffil) netnm: crm2db1:crm2db, mapid: 394435399
                (kffil) netnm: crm2db1:crm2db, mapid: 298438600
                (kffil) netnm: crm2db1:crm2db, mapid: 171948102
                (kffil) netnm: crm2db1:crm2db, mapid: 291
                (kffil) netnm: crm2db1:crm2db, mapid: 289
                (kffil) netnm: crm2db1:crm2db, mapid: 278
                (kffil) netnm: crm2db1:crm2db, mapid: 275
                 .......
                (kffil) netnm: crm2db1:crm2db, mapid: 42
                (kffil) netnm: crm2db1:crm2db, mapid: 38
                (kffil) netnm: crm2db1:crm2db, mapid: 3
                (kffil) netnm: <null>, mapid: <null>
                (kffil) netnm: crm2db1:crm2db, mapid: 4294967295
                (kffil) netnm: crm2db1:crm2db, mapid: 4294967294
                (kffil) netnm: crm2db1:crm2db, mapid: 4294967293
    

    我们可以看到,mapid已经达到最大值4294967296,这是Oracle数据库中数据对象的上限,同时我们从trace文件中的KST trace信息中也能看mapid溢出的类似信息:

    2015-09-17 11:38:06.075000 :C23CEF42:KFNS:kfn.c@708:kfnDispatch(): calling server stub for KFNOP=5
    2015-09-17 11:38:06.075011 :C23CEF45:KFNU:kfns.c@1725:kfnsBackground(): kfnsBackground consuming in-progress message typ=13 unread=1 status=0xfffffff1
    2015-09-17 11:38:06.075012 :C23CEF46:KFNU:kfns.c@1963:kfnsConsume(): kfnsConsume message 0x700000124bd6670 status=0xfffffff1 flags=0x0
    2015-09-17 11:38:06.075023 :C23CEF4C:KFNU:kfn.c@5313:kfnmsg_Dump(): kfnsBg (kfnmsg) opcode=14 (KFNMS_MAPFREE) fd 0x700000124bea1e0 mid 4294967291, gn [7.3971105324], fn [267.811087543] exts 1025, start 60, cnt 965
    2015-09-17 11:38:06.075025 :C23CEF4E:KFNU:kfns.c@1784:kfnsBackground(): kfnsBackground got targeted message
    2015-09-17 11:38:06.075029 :C23CEF54:db_trace:kfns.c@5342:kfnsFillMapMsg(): [10491:27:865] MAP_FREE: gn=7 fn=267, mapid=-5
    2015-09-17 11:38:06.075047 :C23CEF5C:KFNU:kfns.c@1940:kfnsBackground(): kfnsBackground completed in 0 seconds (KFNPM=2)
    2015-09-17 11:38:06.075048 :C23CEF5D:KFNS:kfn.c@739:kfnDispatch(): completed KFNOP=5 callcnt=-1386638497
    2015-09-17 11:38:06.075107 :C23CEF7A:KFNS:kfn.c@708:kfnDispatch(): calling server stub for KFNOP=5
    2015-09-17 11:38:06.075109 :C23CEF7C:KFNU:kfns.c@1725:kfnsBackground(): kfnsBackground consuming in-progress message typ=14 unread=1 status=0xfffffff1
    2015-09-17 11:38:06.075110 :C23CEF7E:KFNU:kfns.c@1963:kfnsConsume(): kfnsConsume message 0x700000126f38a58 status=0xfffffff1 flags=0x0
    2015-09-17 11:38:06.075113 :C23CEF83:KFNU:kfn.c@5313:kfnmsg_Dump(): kfnsBg (kfnmsg) opcode=14 (KFNMS_MAPFREE) fd 0x700000124beffe0 mid 4294967292, gn [7.3971105324], fn [267.811087543] exts 1025, start 60, cnt 965
    2015-09-17 11:38:06.075114 :C23CEF84:KFNU:kfns.c@1784:kfnsBackground(): kfnsBackground got targeted message
    2015-09-17 11:38:06.075118 :C23CEF85:db_trace:kfns.c@5342:kfnsFillMapMsg(): [10491:27:865] MAP_FREE: gn=7 fn=267, mapid=-4
    

    我们可以看出,mapid已经出现了负数,这边表明mapid出现了溢出。

    基于前面的种种分析,我们认为此次故障完全符合Oracle bug 16357438的描述细节,因此建议客户在下次维护时间窗口中安装相关Patch。

    Related posts:

    1. ocssd.log:clssgmpcBuildNodeList: nodename for node x is NULL


沪ICP备19023445号-2号
友情链接