IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    数据库重启之后无法open的恢复案例

    roger发表于 2016-08-04 05:25:05
    love 0

    本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

    本文链接地址: 数据库重启之后无法open的恢复案例

    这是某网友的维护的一套数据库,据说是正常重启之后就无法启动数据库了。那么我们先来看看日志是什么样的:

    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl1_p012_18165.trc:
    ORA-27090: Message 27090 not found;  product=RDBMS; facility=ORA
    Linux-x86_64 Error: 4: Interrupted system call
    Additional information: 3
    Additional information: 128
    Additional information: 65536
    .....
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl1_p007_18153.trc:
    ORA-27090: Message 27090 not found;  product=RDBMS; facility=ORA
    Linux-x86_64 Error: 4: Interrupted system call
    Additional information: 3
    Additional information: 128
    Additional information: 65536
    Starting ORACLE instance (normal)
    LICENSE_MAX_SESSION = 0
    .....
    SMON: enabling cache recovery
    Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_8858.trc:
    ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], []
    .....
    Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_8858.trc:
    ORA-00704: bootstrap process failure
    ORA-00704: bootstrap process failure
    ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], []
    .....
    Error 704 happened during db open, shutting down database
    USER: terminating instance due to error 704

    我们可以看到,节点1在9:48:52秒被强行终止重启了实例。而且我们还可以看出该节点从9:42开始就出现ORA-27090 错误。而该错误通常跟操作系统有关系,通过后面的Linux-x86_64 Error: 4: Interrupted system call 错误也验证了这一点。

    Thread 2 advanced to log sequence 334685 (LGWR switch)
      Current log# 4 seq# 334685 mem# 0: +DATA/orcl/onlinelog/group_4.log
    .....
    Starting ORACLE instance (normal)
    LICENSE_MAX_SESSION = 0
    .....
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_mmon_9401.trc:
    ORA-07445: Message 7445 not found; No message file for product=RDBMS, facility=ORA; arguments: [kgghteFindCB()+188] [SIGSEGV] [Address not mapped to object] [0x00000010B]
    .....
    Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9475.trc:
    ORA-07445: exception encountered: core dump [kglsget()+490] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
    .....
    Error 0 in kwqmnpartition(), aborting txn
    .....
    Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9943.trc:
    ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
    .....
    ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
    .....
    ORA-65535 encountered when generating server alert SMG-4131
    .....
    Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9943.trc:
    ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
    .....
    Reconfiguration started (old inc 2, new inc 4)
    List of nodes:
     0 1
     Global Resource Directory frozen
     Communication channels reestablished
     * domain 0 valid = 1 according to instance 0
    ......
    Completed redo application
    .....
    Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_10551.trc:
    ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], []
    .....
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_smon_9395.trc:
    ORA-00600: Message 600 not found; No message file for product=RDBMS, facility=ORA; arguments: [16659] [kqldtu] [D] [0] [65]
    .....
    Redo Shipping Client Connected as PUBLIC
    -- Connected User is Valid
    ......
    Non-fatal internal error happenned while SMON was doing cursor transient type cleanup.
    SMON encountered 1 out of maximum 100 non-fatal internal errors.
    .....
    Trace dumping is performing id=[cdmp_20160802094934]
    .....
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_smon_9395.trc:
    ORA-00600: Message 600 not found; No message file for product=RDBMS, facility=ORA; arguments: [16659] [kqldtu] [D] [0] [195]
    .....
    Non-fatal internal error happenned while SMON was doing cursor transient type cleanup.
    SMON encountered 2 out of maximum 100 non-fatal internal errors.
    ......
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_pmon_9349.trc:
    ORA-00474: Message 474 not found; No message file for product=RDBMS, facility=ORA
    ......
    PMON: terminating instance due to error 474
    ......
    Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_lmon_9355.trc:
    ORA-00474: Message 474 not found; No message file for product=RDBMS, facility=ORA
    .....
    System state dump is made for local instance
    System State dumped to trace file /u01/app/oracle/admin/orcl/bdump/orcl2_diag_9351.trc
    ......
    Shutting down instance (abort)
    License high water mark = 72
    ......
    Trace dumping is performing id=[cdmp_20160802095002]
    .....
    Instance terminated by PMON, pid = 9349
    .....
    Instance terminated by USER, pid = 11027
    .....
    Starting ORACLE instance (normal)

    这里我们无论是看节点1还是节点2的alert log日志都会发现,由于smon进程在进程事务恢复时失败之后,导致数据库实例最终宕掉。宕掉之后就再也无法正常启动了。很明显这是强行关库之后带来的蝴蝶效应。
    这里我们来看看其中节点2的这个ORA-00600 [16559]是什么含义?

    ERROR: ORA-600 [16659] [a] [b] [c] [d] [e]
    VERSIONS: versions 8.0 to 8.1
    DESCRIPTION: We are attempting to update a tab$ row and fail to update the dictionary information correctly.
    FUNCTIONALITY: DATA DICTIONARY IMPACT: PROCESS FAILURE POSSIBLE DATA DICTIONARY INCONSISTENCY
    

    从解释来看,这是Oracle 数据字典表tab$出现了不一致的情况。比较郁闷的是,客户的dataguard也坏掉了,也是一样的错误。
    那么看来只能进行恢复了。这里首先要明白,节点1的ora-00600 [16703]本质上来讲跟ora-00600 [16559]是一回事。
    从具体的错误来看,Oracle在open时,进行bootstrap初始化的过程就失败了,因此报错ORA-00704: bootstrap process failure.
    处理思路也很简单,我们首先通过10046 trace跟踪open的过程,来看看Oracle 在bootstrap初始化的时候在进行什么操作时报错的?

    PARSING IN CURSOR #2 len=106 dep=1 uid=0 oct=3 lid=0 tim=1435661277781458 hv=3628073639 ad='cfd74588'
    select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
    END OF STMT
    PARSE #2:c=999,e=676,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1435661277781453
    =====================
    .....
    BINDS #2:
    kkscoacd
     Bind#0
      oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
      oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
      kxsbbbfp=2b5f7bae1b00  bln=22  avl=02  flg=05
      value=20
    EXEC #2:c=999,e=1015,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1435661277798049
    WAIT #2: nam='db file sequential read' ela= 1086 file#=1 block#=50 blocks=1 obj#=-1 tim=1435661277799253
    WAIT #2: nam='db file sequential read' ela= 568 file#=1 block#=51 blocks=1 obj#=-1 tim=1435661277799931
    WAIT #2: nam='db file sequential read' ela= 1804 file#=1 block#=26 blocks=1 obj#=-1 tim=1435661277801863
    FETCH #2:c=0,e=3833,p=3,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=1435661277801922
    *** 2016-08-02 13:52:28.469
    ksedmp: internal or fatal error
    ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], []
    Current SQL statement for this session:
    alter database open

    从上面的错误不难看出就是在访问tab$ 的时候报错的,而且是访问的obj#=20的这个对象。那么这个对象是什么呢?

    SQL> select owner,object_name,object_type from dba_objects where object_id=20;
    
    OWNER                          OBJECT_NAME                    OBJECT_TYPE
    ------------------------------ ------------------------------ -------------------
    SYS                            ICOL$                          TABLE
    
    SQL> !oerr ora 1403
    01403, 00000, "no data found"
    // *Cause:
    // *Action:

    根据我们的查询以及对ORA-00600 [16703],[1403],[20] 这个错误的理解,那么我这里可以大致判断这个错误后的几个数字的含义:
    16703: 错误代码,表示数据字典基表存在不一致1403: 表示数据没找到或者不匹配,即not data found.20: 表示访问的对象号,即object_id.
    同时我们从前面的10046 trace跟踪来看,报错的SQL语句访问了3个block,然后报错,分别是file 1 block 50,51,26。
    这我们分别dump 上面的3个block发现其中block 51,26 的dump 内容如下:

    ---block 51
    Object id on Block? Y
     seg/obj: 0x3  csc: 0x00.64205  itc: 2  flg: -  typ: 2 - INDEX
         fsl: 0  fnx: 0x0 ver: 0x01
    
     Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
    0x01   0x0001.008.0000000d  0x0080118a.0010.01  CB--    0  scn 0x0000.000105c7
    0x02   0x000a.01e.000005e0  0x008074fc.007a.36  --U-    1  fsc 0x0000.00064208
    
    Leaf block dump
    ===============
    header address 185777756=0xb12be5c
    kdxcolev 0
    KDXCOLEV Flags = - - -
    kdxcolok 0
    kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
    kdxconco 1
    kdxcosdc 2
    kdxconro 258
    kdxcofbo 552=0x228
    kdxcofeo 4484=0x1184
    kdxcoavs 3932
    kdxlespl 0
    kdxlende 0
    kdxlenxt 4194360=0x400038
    kdxleprv 0=0x0
    kdxledsz 8
    kdxlebksz 8032
    ...
    row#93[5724] flag: ------, lock: 2, len=14, data:(8):  00 40 38 8e 00 06 02 00
    col 0; len 3; (3):  c2 03 03
    ......
    
    ---block 26
    Block header dump:  0x0040001a
     Object id on Block? Y
     seg/obj: 0x2  csc: 0x05.e0568950  itc: 2  flg: -  typ: 1 - DATA
         fsl: 0  fnx: 0x0 ver: 0x01
    
     Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
    0x01   0x0000.05f.00000002  0x00400012.0001.16  C---    0  scn 0x0000.000000fd
    0x02   0x000b.00d.00048164  0x0100982b.b8bd.4d  --U-    7  fsc 0x039f.e0568c3e
    
    data_block_dump,data header at 0xb12be5c
    ===============
    tsiz: 0x1fa0
    hsiz: 0x140
    pbl: 0x0b12be5c
    bdba: 0x0040001a
         76543210
    flag=--------
    ntab=6
    nrow=141
    frre=-1
    fsbo=0x140
    fseo=0x160
    avsp=0x20
    tosp=0x3db
    0xe:pti[0]      nrow=8  offs=0
    0x12:pti[1]     nrow=7  offs=8
    0x16:pti[2]     nrow=1  offs=15
    0x1a:pti[3]     nrow=10 offs=16
    0x1e:pti[4]     nrow=15 offs=26
    0x22:pti[5]     nrow=100        offs=41
    .....
    col  0: [ 2]  c1 14
    tab 1, row 0, @0x1e65
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 0
    tab 1, row 1, @0x1ddb
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 1
    tab 1, row 2, @0x1d51
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 2
    tab 1, row 3, @0x1ccd
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 4
    tab 1, row 4, @0x1c45
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 5
    tab 1, row 5, @0x1bc0
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 6
    tab 1, row 6, @0x1b35
    tl: 4 fb: -CHDFL-- lb: 0x2  cc: 0 cki: 7
    tab 2, row 0, @0x160
    tl: 48 fb: -CH-FL-- lb: 0x0  cc: 19 cki: 3
    col  0: [ 2]  c1 1e
    col  1: [ 1]  80
    .....
    col 17: *NULL*
    col 18: [ 1]  80
    tab 3, row 0, @0x1ad0
    .....

    看到这里,我就想是否可以通过bbed先把这2个block 给修复了,看看是否能够起来。如下是简单的修复过程:
    对于51号block 由于是Index 修改非常简单,这里不多说。26号block 是cluster table,这个相对复杂的多。首先提交事务、修改lock flag之后verify还是报错,如下:

    BBED> verify
    DBVERIFY - Verification starting
    FILE = /u01/fix/SYSTEM_OLD.dbf
    BLOCK = 26
    
    Block Checking: DBA = 4194330, Block Type = KTB-managed data block
    data header at 0x105d485c
    kdbchk:  key comref count wrong
             keyslot=7
    Block 26 failed with check code 6121
    
    DBVERIFY - Verification complete
    
    Total Blocks Examined         : 1
    Total Blocks Processed (Data) : 1
    Total Blocks Failing   (Data) : 1
    Total Blocks Processed (Index): 0
    Total Blocks Failing   (Index): 0
    Total Blocks Empty            : 0
    Total Blocks Marked Corrupt   : 0
    Total Blocks Influx           : 0

    这里继续修改聚簇对应的kdbr信息(这里以其中一个kdbr为例):

    BBED> p *kdbr[7]
    rowdata[7568]
    -------------
    ub1 rowdata[7568]                           @8012     0xac
    
    BBED> x /rcccccc
    rowdata[7568]                               @8012
    -------------
    flag@8012: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)
    lock@8013: 0x00
    cols@8014:    1
    kref@8015:   31
    mref@8017:   30
    hrid@8019:0x0040001b.0
    nrid@8025:0x0040001b.0
    
    BBED> modify /x 1f offset 8017
     File: /u01/fix/SYSTEM_OLD.dbf (1)
     Block: 26               Offsets: 8017 to 8020           Dba:0x0040001a
    ------------------------------------------------------------------------
     1f000040 
    
     <32 bytes per line
    
    BBED> verify
    DBVERIFY - Verification starting
    FILE = /u01/fix/SYSTEM_OLD.dbf
    BLOCK = 26
    
    Block Checking: DBA = 4194330, Block Type = KTB-managed data block
    data header at 0x105d485c
    kdbchk: space available on commit is incorrect
            tosp=987 fsc=0 stb=0 avsp=32
    Block 26 failed with check code 6111
    
    DBVERIFY - Verification complete
    
    Total Blocks Examined         : 1
    Total Blocks Processed (Data) : 1
    Total Blocks Failing   (Data) : 1
    Total Blocks Processed (Index): 0
    Total Blocks Failing   (Index): 0
    Total Blocks Empty            : 0
    Total Blocks Marked Corrupt   : 0
    Total Blocks Influx           : 0
    
    BBED> p kdbh
    struct kdbh, 14 bytes                       @92
       ub1 kdbhflag                             @92       0x00 (NONE)
       b1 kdbhntab                              @93       6
       b2 kdbhnrow                              @94       141
       sb2 kdbhfrre                             @96      -1
       sb2 kdbhfsbo                             @98       320
       sb2 kdbhfseo                             @100      352
       b2 kdbhavsp                              @102      32
       b2 kdbhtosp                              @104      987
    
    BBED> d /v offset 104 count 2
     File: /u01/fix/SYSTEM_OLD.dbf (1)
     Block: 26      Offsets:  104 to  105  Dba:0x0040001a
    -------------------------------------------------------
     db03                                l ..
    
     <16 bytes per line>
    
    BBED> modify /x 2000 offset 104
     File: /u01/fix/SYSTEM_OLD.dbf (1)
     Block: 26               Offsets:  104 to  105           Dba:0x0040001a
    ------------------------------------------------------------------------
     2000 
    
     <32 bytes per line>
    
    BBED> sum apply
    Check value for File 1, Block 26:
    current = 0x87ce, required = 0x87ce
    BBED> verify
    DBVERIFY - Verification starting
    FILE = /u01/fix/SYSTEM_OLD.dbf
    BLOCK = 26
    
    DBVERIFY - Verification complete
    
    Total Blocks Examined         : 1
    Total Blocks Processed (Data) : 1
    Total Blocks Failing   (Data) : 0
    Total Blocks Processed (Index): 0
    Total Blocks Failing   (Index): 0
    Total Blocks Empty            : 0
    Total Blocks Marked Corrupt   : 0
    Total Blocks Influx           : 0

    我们经过几处简单修改之后,再次verify校验已经不再报错了;不过再次open数据库时,发现报另外一个错误了:

    Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_18955.trc:
    ORA-00704: bootstrap process failure
    ORA-00704: bootstrap process failure
    ORA-00600: internal error code, arguments: [kdoirp-3], [139], [0], [], [], [], [], []

    从错误来看,bootstrap的初始化过程仍然有问题。通过10046 trace跟踪发现还是那几个block。
    回想前面这个block的dump时,看到的几行操作是delete,如下:tl: 4 fb: -CHDFL– lb: 0×2  cc: 0 cki: 0
    那么我们这里试做将这几个被删除的操作进行还原是否ok 呢? 也就是用bbed来恢复这7个delete操作。
    由于是cluster table 的block,操作相对麻烦一些。不过我尝试修改之后,最后发现错误仍然一样。
    其中[kdoirp-3]是什么含义呢? 我们来看下Oracle 文档的描述:

    Layer 11:  KCOCODRW -  Row
    opcode 1 :  KDOIUR  - interpret undo redo
    opcode 2 :  KDOIRP  - insert row  piece
    opcode 3 :  KDODRP  - drop row piece
    opcode 4 :  KDOLKR  - lock row  piece
    opcode 5 :  KDOURP  - update row piece
    opcode 6 :  KDOORP  - overwrite row piece
    opcode 7 :  KDOMFC  - manipulate first column
    opcode 8 :  KDOCFA  - change forwarding address
    opcode 9 :  KDOCKI  - change cluster key index
    opcode 10 :  KDOSKL  - set key links
    opcode 11 :  KDOQMI  - quick multi-insert (ex. insert as select...)
    opcode 12 :  KDOQMD  - quick multi-delete
    opcode 13 :  KDOTBF  - toggle block header flags

    很明显,这表示insert row piece。 看来我们单纯的修改这2个block 并不能绕过这个问题。 实际上后面我dump分析发现又涉及到_next_object,又将问题复杂化了。
    虽然我相信多折腾几次可以解决这个问题。但是操作确实麻烦,费劲。不过此时通过之前的备份restore出来的system文件已经ok了。这里我用bbed 将涉及到的几个block 进行替换,最后再修改resetlogs信息,重建控制文件之后,进行recover。
    非常顺利的打开了数据库。最后检查alert log 还涉及到smon 回滚某个事务失败。那么如何完美处理呢?
    首先dump undo header,然后获取该事务涉及的操作对象,然后使用如下参数屏蔽回滚段后,将undo表空间重建即可。
    如下是dump undo header获取的信息:
    针对这部分对象,由于破坏了事务的完整性,那么建议对表进行分析,其中Index进行重建。

    Related posts:

    1. 一次远程协助的恢复 遇到异灵事件
    2. 一次TB级ERP(ASM RAC)库的恢复
    3. ora-00600 kccpb_sanity_check_2和kclchkblk_4的恢复case
    4. sysaux大面积坏块的例子
    5. Deep in ora-00600 [4193]


沪ICP备19023445号-2号
友情链接