IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    某客户的5TB RAC 恢复小记

    admin发表于 2015-03-16 15:52:11
    love 0

    本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

    本文链接地址: 某客户的5TB RAC 恢复小记

    某客户的核心数据库存储问题导致数据库重启后无法正常启动,根据客户反馈最开始在启动数据库时
    报错控制文件IO错误,如下:

    Sun Mar 15 11:59:37 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
    ORA-00204: error in reading (block 1, # blocks 1) of control file
    ORA-00202: control file: '/xxx/xxxx/control01.ctl'
    ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
    Sun Mar 15 11:59:37 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
    ORA-00204: error in reading (block 1, # blocks 1) of control file
    ORA-00202: control file: '/xxx/xxxx/control01.ctl'
    ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
    Sun Mar 15 11:59:37 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
    ORA-00204: error in reading (block 1, # blocks 1) of control file
    ORA-00202: control file: '/xxx/xxxx/control01.ctl'
    ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
    Sun Mar 15 11:59:37 2015
    Master background archival failure: 204
    Sun Mar 15 11:59:49 2015
    Termination issued to instance processes. Waiting for the processes to exit
    Sun Mar 15 15:40:09 2015
    Starting ORACLE instance (normal)

    上述的问题本质上都跟控制文件有关系,替换掉损坏的控制文件就行。当替换掉控制文件之后,在open数据库时发现报如下错误:

    Sun Mar 15 16:10:48 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_626734.trc:
    ORA-00600: internal error code, arguments: [kcrfr_update_nab_2], [0x70000038F8C94E0], [2], [], [], [], [], []
    Abort recovery for domain 0
    Sun Mar 15 16:10:49 2015
    Aborting crash recovery due to error 600
    Sun Mar 15 16:10:49 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_626734.trc:
    ORA-00600: internal error code, arguments: [kcrfr_update_nab_2], [0x70000038F8C94E0], [2], [], [], [], [], []
    ORA-600 signalled during: ALTER DATABASE OPEN...
    Sun Mar 15 16:10:49 2015
    Trace dumping is performing id=[cdmp_20150315161049]
    Sun Mar 15 16:12:35 2015
    Shutting down instance: further logons disabled
    Sun Mar 15 16:12:35 2015

    该错误本质上是因为redo的问题,即有redo log损坏。通过在RMAN进行recover,发现报如下类似错误:

    Sun Mar 15 16:47:59 2015
    Beginning crash recovery of 2 threads
     parallel recovery setup failed: using serial mode
    Sun Mar 15 16:47:59 2015
    Started redo scan
    Sun Mar 15 16:47:59 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
    ORA-00313: open failed for members of log group 5 of thread 2
    ORA-00312: online log 5 thread 2: '/xxx/xxxx/redo05a.log'
    ORA-17503: ksfdopn:4 Failed to open file /xxx/xxxx/redo05a.log
    ORA-17500: ODM err:File does not exist
    Sun Mar 15 16:47:59 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
    ORA-00313: open failed for members of log group 4 of thread 1
    ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04a.log'
    ORA-17503: ksfdopn:4 Failed to open file /xxx/xxxx/redo04a.log
    ORA-17500: ODM err:File does not exist
    Sun Mar 15 17:03:03 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
    ORA-00354: corrupt redo log block header
    ORA-00353: log corruption near block 2009344 change 14160745159583 time 03/15/2015 11:56:29
    ORA-00334: archived log: '/xxx/xxxx/redo04b.log'
    Sun Mar 15 17:03:03 2015
    Abort recovery for domain 0
    Sun Mar 15 17:03:03 2015
    Aborting crash recovery due to error 354
    Sun Mar 15 17:03:03 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
    ORA-00354: corrupt redo log block header
    ORA-00353: log corruption near block 2009344 change 14160745159583 time 03/15/2015 11:56:29
    ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04b.log'
    ORA-354 signalled during: ALTER DATABASE OPEN...
    Sun Mar 15 17:08:02 2015

    上述过程大致是客户之前的处理过程。我在18点左右介入之后,进行了相关的操作。我最开始尝试在利用RMAN 进行恢复,发现报错:

    RMAN> recover database;
    
    Starting recover at 15-MAR-15
    using target database control file instead of recovery catalog
    allocated channel: ORA_DISK_1
    channel ORA_DISK_1: sid=3268 instance=xxxx2 devtype=DISK
    
    starting media recovery
    
    media recovery failed
    RMAN-00571: ===========================================================
    RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
    RMAN-00571: ===========================================================
    RMAN-03002: failure of recover command at 03/15/2015 18:39:26
    ORA-00283: recovery session canceled due to errors
    RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed
     start
    ORA-00283: recovery session canceled due to errors
    ORA-00354: corrupt redo log block header
    ORA-00353: log corruption near block 1788672 change 14160744248478 time 03/15/2015 11:54:46
    ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04a.log'
    

    从上面的错误来看,初步可以判断redo04a.log文件已经损坏,而且是block 1788672的问题。为了验证该block是否损坏,我通过类似如下的dump 命令进行dump,发现报错:

    alter system dump logfile 'xxx' scn min xxxx scn max xxxx;

    由此判断,该block损坏无疑。 由于客户的需求是尽可能快的将数据库拉起来,因此对应redo损坏的情况之下。
    通常只能进程不完全恢复并强制打开,这里我使用了如下的参数:

    *._allow_resetlogs_corruption=TRUE
    *._allow_error_simulation=TRUE

    在open resetlogs之前,我已经将redo备份,resetlogs打开时,发现数据库报错如下:

    Sun Mar 15 19:43:36 2015
    MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
    Sun Mar 15 19:43:36 2015
    SMON: enabling cache recovery
    Sun Mar 15 19:43:37 2015
    Instance recovery: looking for dead threads
    Instance recovery: lock domain invalid but no dead threads
    Sun Mar 15 19:43:37 2015
    ORA-01555 caused by SQL statement below (SQL ID: 5wc2915k44m38, Query Duration=0 sec, SCN: 0x0ce1.0e2d8971):
    Sun Mar 15 19:43:37 2015
    select user#,type# from user$ where name=:1
    Sun Mar 15 19:43:37 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_667814.trc:
    ORA-00704: bootstrap process failure
    ORA-00604: error occurred at recursive SQL level 1
    ORA-01555: snapshot too old: rollback segment number 25 with name "_SYSSMU25$" too small
    Error 704 happened during db open, shutting down database
    USER: terminating instance due to error 704
    Instance terminated by USER, pid = 667814
    ORA-1092 signalled during: alter database open resetlogs...

    从日志来看,大致判断可能是_SYSSMU25$ 回滚段的问题,因此尝试先通过如下隐含参数屏蔽回滚段:

    _corrupted_rollback_segments=_SYSSMU25$
    _offline_rollback_segments=_SYSSMU25$

    屏蔽回滚段之后,尝试打开数据库,发现错误依旧,通过10046 trace跟踪,发现递归SQL在如下的block上执行失败:

    PARSING IN CURSOR #3 len=43 dep=1 uid=0 oct=3 lid=0 tim=37951056727245 hv=1682066536 ad='8cb74a90'
    select user#,type# from user$ where name=:1
    END OF STMT
    PARSE #3:c=0,e=372,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=37951056727243
    BINDS #3:
    kkscoacd
     Bind#0
     oacdty=01 mxl=32(03) mxlc=00 mal=00 scl=00 pre=00
     oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
     kxsbbbfp=1126d4b70  bln=32  avl=03  flg=05
     value="XDB"
    EXEC #3:c=0,e=465,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=37951056727780
    WAIT #3: nam='db file sequential read' ela= 1582 file#=1 block#=282 blocks=1 obj#=44 tim=37951056729421
    WAIT #3: nam='db file sequential read' ela= 6642 file#=1 block#=91 blocks=1 obj#=22 tim=37951056736126

    通过dump file 1 block 91,发现该block上第2个ITL确认存在一个活跃事务。原本计划直接bbed提交该事务,但是当我编译好bbed之后,查看发现该block为一个cluster block.

    对于cluster block的事务修改,相对复杂一些,我的博客有文章描述,大家可以参考,这里不多说。考虑到生产库使用bbed有一定的风险,我并没有使用bbed。

    接着使用undo_management参数启动数据库,然后强制open数据库,发现错误变成如下:

    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_778430.trc:
    ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
    Sun Mar 15 20:50:52 2015
    Instance recovery: looking for dead threads
    Instance recovery: lock domain invalid but no dead threads
    Sun Mar 15 20:50:53 2015
    Redo thread 1 internally disabled at seq 1 (CKPT)
    Sun Mar 15 20:50:53 2015
    ARC1: Archiving disabled thread 1 sequence 1
    Sun Mar 15 20:50:54 2015
    Trace dumping is performing id=[cdmp_20150315205054]
    Sun Mar 15 20:50:54 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_778430.trc:
    ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
    Sun Mar 15 20:50:54 2015
    Error 600 happened during db open, shutting down database
    USER: terminating instance due to error 600
    Instance terminated by USER, pid = 778430
    ORA-1092 signalled during: alter database open  resetlogs...

    从错误来看,我们就可以知道,这应该是SCN的问题。如果要手工推进SCN,那么level应该待遇3297*4才行,由于这里的238091117/1024/1024/1024小于1,因此推进scn时,level=3297*4+2 就差不多了。 这里我再次进行了10046 trace,发现了如下信息:

    WAIT #5: nam='db file sequential read' ela= 1021 file#=1 block#=400 blocks=1 obj#=0 tim=37953716489772
    EXEC #5:c=0,e=2969,p=1,cr=2,cu=3,mis=1,r=1,dep=1,og=4,tim=37953716490098
    STAT #5 id=1 cnt=1 pid=0 pos=1 obj=0 op='UPDATE  UNDO$ (cr=2 pr=1 pw=0 time=1542 us)'
    STAT #5 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=2 pr=0 pw=0 time=11 us)'
    WAIT #1: nam='row cache lock' ela= 71 cache id=3 mode=0 request=5 obj#=0 tim=37953716490369
    WAIT #1: nam='db file sequential read' ela= 5783 file#=2 block#=25 blocks=1 obj#=0 tim=37953716496201
    ........
    ........
    GLOBAL CACHE ELEMENT DUMP (address: 700000011f91498):
     id1: 0x19 id2: 0x20000 obj: US#2 block: (2/25)
     lock: SL rls: 0x0000 acq: 0x0000 latch: 3
     flags: 0x41 fair: 0 recovery: 0 fpin: 'ktuwh02: ktugus'
     bscn: 0x0.0 bctx: 0 write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
     lcp: 0 lnk: [NULL] lch: [70000023bf4bc20,70000023bf4bc20]
     seq: 3 hist: 143:0 208 352
     LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
     flg: 0x00080000 state: READING mode: EXCL
     pin: 'ktuwh02: ktugus'
     addr: 70000023bf4bb10 obj: INVALID cls: UNDO HEAD bscn: 0xce1.e379b05  ---这里的bscn即scn值
     GCS SHADOW 700000011f91508,1 sq[70000035fb339f8,70000035fb339f8] resp[70000035fb339d0,0x19.20000] pkey 4294950914
     grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
     master 1 owner 1 sid 0 remote[0,0] hist 0x106
     history 0x6.0x4.0x0.0x0.0x0.0x0. cflag 0x0 sender 0 flags 0x0 replay# 0
     disk: 0x0000.00000000 write request: 0x0000.00000000
     pi scn: 0x0000.00000000
    msgseq 0x0 updseq 0x0 reqids[1,0,0] infop 0x0
     GCS RESOURCE 70000035fb339d0 hashq [70000038cbc6658,70000038cbc6658] name[0x19.20000] pkey 4294950914
     grant 700000011f91508 cvt 0 send 0,0 write 0,0@65535
     flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
     disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
     xid 0x0000.000.00000000 sid 0 pkwait 59s
     pkey 4294950914
     hv 0 [stat 0x0, 1->1, wm 32767, RMno 0, remxxx 0, dom 0]
     kjga st 0x4, step 0.0.0, cxxx 2, rmno 0, flags 0x0
     lb 0, hb 0, myb 6147, drmb 6147, apifrz 0
     GCS SHADOW 700000011f91508,1 sq[70000035fb339f8,70000035fb339f8] resp[70000035fb339d0,0x19.20000] pkey 4294950914
     grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
     master 1 owner 1 sid 0 remote[0,0] hist 0x106
     history 0x6.0x4.0x0.0x0.0x0.0x0. cflag 0x0 sender 0 flags 0x0 replay# 0
     disk: 0x0000.00000000 write request: 0x0000.00000000
     pi scn: 0x0000.00000000
    msgseq 0x0 updseq 0x0 reqids[1,0,0] infop 0x0
    kjbmbassert [0x19.20000]
    *** 2015-03-15 20:54:54.385
    ksedmp: internal or fatal error
    ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
    Current SQL statement for this session:
    alter database open resetlogs

    bscn: 0xce1.e379b05 将该scn进行转换,我们可以发现:0xce1 为3297,e379b05为238525189. 与上述报错信息一致。同时我发现这里使用了第2号回滚段,如下:

    id1: 0x19 id2: 0x20000 obj: US#2 block: (2/25)

    因此,尝试继续使用隐含参数屏蔽这第2号回滚段,并尝试打开数据库,但是错误依旧。看来还是需要调整SCN才行,如下:

    Sun Mar 15 21:23:20 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
    ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
    Sun Mar 15 21:23:20 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
    Sun Mar 15 21:23:21 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
    ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
    Sun Mar 15 21:23:21 2015
    Error 600 happened during db open, shutting down database
    USER: terminating instance due to error 600
    Instance terminated by USER, pid = 774222

    首先我尝试了在会话级别设置:

    alter session set events '10015 trace name adjust_scn level 13190';

    发现alter database open失败,尝试使用*._minimum_giga_scn参数,但是在启动的时候,提示说该参数不支持。从此判断,该环境可能是安装了比较新的PSU,Oracle将该参数废弃掉了,这么说前面的10015 event根本就没起作用。 无奈只能通过oradebug手工修改SCN来启动数据库了,如下:

    SQL> startup mount pfile='/tmp/gb.ora';
    ORACLE instance started.
    
    Total System Global Area 1.5032E+10 bytes
    Fixed Size                  2110096 bytes
    Variable Size            5704256880 bytes
    Database Buffers         9311354880 bytes
    Redo Buffers               14663680 bytes
    Database mounted.
    SQL> oradebug setmypid
    Statement processed.
    SQL> oradebug DUMPvar SGA kcsgscn_
    kcslf kcsgscn_ [7000000100122A8, 7000000100122D8) = 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 07000000 ...
    SQL>
    SQL> oradebug poke 0x7000000100122A8 4 3300
    BEFORE: [7000000100122A8, 7000000100122AC) = 00000000
    AFTER:  [7000000100122A8, 7000000100122AC) = 00000CE4
    SQL> oradebug DUMPvar SGA kcsgscn_
    kcslf kcsgscn_ [7000000100122A8, 7000000100122D8) = 00000CE4 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 07000000 ...
    SQL>
    

    修改SCN之后,顺利打开了数据库,但是数据库很快就crash掉,如下是日志信息:

    Sun Mar 15 21:47:31 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ......
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:33 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
    QMNC started with pid=32, OS id=520520
    Sun Mar 15 21:47:35 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
    Sun Mar 15 21:47:35 2015
    ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (44, 26) on object 47098.
    Sun Mar 15 21:47:35 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
    Sun Mar 15 21:47:36 2015
    LOGSTDBY: Validating controlfile with logical metadata
    Sun Mar 15 21:47:36 2015
    LOGSTDBY: Validation complete
    Sun Mar 15 21:47:36 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:36 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:36 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
    Sun Mar 15 21:47:37 2015
    ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (48, 25).
    Sun Mar 15 21:47:37 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
    Sun Mar 15 21:47:39 2015
    Completed: alter database open
    Sun Mar 15 21:47:39 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:39 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:39 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
    Sun Mar 15 21:47:40 2015
    ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (65, 7).
    Sun Mar 15 21:47:40 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
    Sun Mar 15 21:47:40 2015
    Trace dumping is performing id=[cdmp_20150315214740]
    Sun Mar 15 21:47:41 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
    ORA-12012: error on auto execute of job 524
    ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
    ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
    ORA-06512: at line 1
    Sun Mar 15 21:47:41 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
    ORA-12012: error on auto execute of job 524
    ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
    ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
    ORA-06512: at line 1
    Sun Mar 15 21:47:41 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
    ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
    Sun Mar 15 21:47:41 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:41 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    Sun Mar 15 21:47:41 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
    ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
    Sun Mar 15 21:47:42 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j008_586068.trc:
    ORA-12012: error on auto execute of job 526
    ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
    ORA-06512: at "SYS.xxx_SEG_xxx", line 3
    ORA-06512: at line 1
    Sun Mar 15 21:47:42 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ORA-12012: error on auto execute of job 524
    ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
    ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
    ORA-06512: at line 1
    Sun Mar 15 21:47:42 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
    Sun Mar 15 21:47:43 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ORA-12012: error on auto execute of job 524
    ORA-01552: cannot use system rollback segment for non-system tablespace 'XXXX_ADMIN'
    ORA-06512: at "SYS.XXXX_LOGINHISTORY", line 3
    ORA-06512: at line 1
    Sun Mar 15 21:47:43 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
    ORA-00604: error occurred at recursive SQL level 1
    ORA-08102: index key not found, obj# 239, file 1, block 1674 (2)
    Sun Mar 15 21:47:43 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
    ORA-00339: archived log does not contain any redo
    ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
    ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
    Sun Mar 15 21:47:43 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
    ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], []
    ORA-06512: at "xxxx.PKG_XXXXX", line 126
    ORA-06512: at line 3
    Sun Mar 15 21:47:43 2015
    Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_pmon_565700.trc:
    ORA-00474: SMON process terminated with error
    Sun Mar 15 21:47:43 2015
    PMON: terminating instance due to error 474
    Sun Mar 15 21:47:47 2015
    Dump system state for local instance only
    System State dumped to trace file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_diag_377122.trc
    Sun Mar 15 21:47:48 2015
    Instance terminated by PMON, pid = 565700

    从上述日志信息来看,主要出现了如下几个错误:

    ORA-00600 [6006],ORA-00600 [4137],ORA-00600 [kdsgrp1]

    对于前面2个错误,很明显是Oracle SMON进程在进行利用回滚段进行事务rollback时失败导致,如下:
    ORACLE Instance xxxx2 (pid = 22) – Error 600 encountered while recovering transaction (44, 26) on object 47098.
    ORACLE Instance xxxx2 (pid = 22) – Error 600 encountered while recovering transaction (48, 25).

    因此,不难看出,数据库中还有部分的回滚段存在活跃事务。

    对于ORA-00600 [kdsgrp1]错误,通常是出现在Index上,即Index数据和表的数据不一致,一般来说可以通过重建解决。

    其次,对于后面的ORA-08102: index key not found, obj# 239, file 1, block 1674 (2) 错误,主要是job调用出现,因此
    我们可以暂时屏蔽job的调度。

    对于ORA-08102错误,我的博客几年前也写过相关的文章,本质上也是Index block中的相关键值不存在导致。

    与其如此,最后我感觉将数据库的所有回滚段都屏蔽掉,并重建数据库undo 表空间,如下是获取回滚段的命令:

    strings system01.dbf | grep _SYSSMU | cut -d $ -f 1 | sort -u

    经过整理,发现该库存在大约2600个回滚段,我了个去,先不管这么多,重启实例后,重建undo表空间:

    SQL> conn /as sysdba
    Connected to an idle instance.
    SQL> startup upgrade pfile='/tmp/gb2.ora';
    ORACLE instance started.
    
    Total System Global Area 1.5032E+10 bytes
    Fixed Size                  2110096 bytes
    Variable Size            5704256880 bytes
    Database Buffers         9311354880 bytes
    Redo Buffers               14663680 bytes
    Database mounted.
    Database opened.
    SQL> create undo tablespace undotbs11 datafile '/xxx/xxxx/undotbs11_01.dbf' size 100m;
    
    Tablespace created.
    
    SQL> create undo tablespace undotbs22 datafile '/xxx/xxxx/undotbs22_01.dbf' size 100m;
    
    Tablespace created.
    
    SQL> drop tablespace undotbs1 xxxluding contents and datafiles;
    
    Tablespace dropped.
    
    SQL> drop tablespace undotbs2 xxxluding contents and datafiles;
    
    Tablespace dropped.
    

    最后重启数据库实例,让客户将关键核心的配置表导出,先进行业务恢复,如果需要数据,直接从库中抽取。
    这里要补充一点,该库约为5TB多一点,虽然有备份,但是恢复时间太长,如果有个dataguard是多么的重要啊!

    Related posts:

    1. BUG 10008092 caused instance crash
    2. 非归档恢复遭遇ORA-01190 和 ORA-600 [krhpfh_03-1202]–恢复小记
    3. 一次TB级ERP(ASM RAC)库的恢复
    4. 1.4TB ASM(RAC) 磁盘损坏恢复小记
    5. 朋友的一个问题:9i的库open之后大量ora-00600错误


沪ICP备19023445号-2号
友情链接