IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    记一次TB级别的Exadata数据库恢复

    admin发表于 2015-10-21 11:44:58
    love 0

    本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

    本文链接地址: 记一次TB级别的Exadata数据库恢复

    某客户的exadata环境,由于大量硬件故障,导致数据库宕机。经过分析确认是由于某个cell节点的某个disk出现异常,导致部分

    cell节点的disk 离线后无法进行asm reblance;最终导致diskgroup 被强制dismount了。如下是alert log:

    Fri Oct 09 15:44:26 2015
    NOTE: process _user4655_+asm1 (4655) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
    NOTE: checking PST: grp = 1
    ERROR: Disk 18 cannot be offlined, since all the disks [18, 49] with mirrored data would be offline.
    ERROR: too many offline disks in PST (grp 2)
    GMON checking disk modes for group 1 at 55 for pid 41, osid 4655
    Fri Oct 09 15:44:26 2015
    NOTE: cache dismounting (not clean) group 2/0x4AD8D253 (DBFS_DG)
    ERROR: too many offline disks in PST (grp 1)
    NOTE: checking PST for grp 1 done.
    WARNING: Disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
    NOTE: initiating PST update: grp = 1, dsk = 46/0xe968239f, mode = 0x6a, op = 4
    GMON updating disk modes for group 1 at 56 for pid 41, osid 4655
    ERROR: Disk 46 cannot be offlined, since all the disks [46, 72] with mirrored data would be offline.
    ERROR: too many offline disks in PST (grp 1)
    WARNING: Offline of disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 and mode 0x7f failed on ASM inst 1
    WARNING: Disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 mode 0x7f offline is being aborted
    Fri Oct 09 15:44:26 2015
    NOTE: process _user4663_+asm1 (4663) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
    NOTE: checking PST: grp = 1
    WARNING: Offline of disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 and mode 0x7f failed on ASM inst 1
    Fri Oct 09 15:44:26 2015
    NOTE: halting all I/Os to diskgroup 2 (DBFS_DG)
    NOTE: unable to offline disks after getting write error for diskgroup DBFS_DG
    NOTE: write to disk 32 succeeded
    NOTE: disk 18 had IO error
    NOTE: write to disk 6 succeeded
    NOTE: halting all I/Os to diskgroup 1 (DATA_KLYX)
    NOTE: LGWR doing non-clean dismount of group 1 (DATA_KLYX)
    NOTE: LGWR sync ABA=21.5465 last written ABA 21.5465
    GMON checking disk modes for group 1 at 57 for pid 44, osid 4663
    ERROR: too many offline disks in PST (grp 1)
    NOTE: checking PST for grp 1 done.

    我们可以看到,cell 2节点的第10号disk存在异常,通过cellcli命令行也可以进一步确认:

    CellCLI> list physicaldisk;
             20:0            E07Q5N          normal
             20:1            E05SMD          normal
             20:2            E07GV6          normal
             20:3            E060YD          normal
             20:4            E05ZQE          normal
             20:5            E07L9M          normal
             20:6            E07FED          normal
             20:7            E05N88          normal
             20:8            E08DY5          normal
             20:9            E06ADQ          normal
             20:10           E08E5C          critical
             20:11           E08DYF          normal

    通过对该磁盘进行修复检查,确实也发现了坏道,如下所示:

    SEAGATE ST360057SSUN600G --> SEAGATE ST360057SSUN600G
    Sector 0 --> Sector 0
    1,172,123,568 Sectors
    Sectors that could not be read:
    102,578,137
    102,578,139
    102,578,140
    102,578,141
    102,578,142
    102,578,149
    102,578,152
    ......
    1,172,123,561 sector(s) successfully copied.
    7 bad source sectors encountered.
    Corresponding destination sectors filled with: UNREADABLESECTOR

    修复完毕后,成功mount diskgroup。但是最后检查数据库存在部分数据坏块,如下所示:

    SQL>  select * from  V$DATABASE_BLOCK_CORRUPTION;
    
         FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
    ---------- ---------- ---------- ------------------ -----------------
             1    5186816          3                  0 CORRUPT
             1    5186813          1                  0 CORRUPT
             1    3343734          2                  0 CORRUPT
            25    2391372          1         1.4561E+13 CORRUPT
            35     538916          1                  0 CORRUPT
            36     543804          3                  0 CORRUPT

    对于数据坏块的修复相对比较简单,只是由于是asm环境处理起来相对麻烦一些,加上客户没有备份,所以无法进行blockcover。对于system的几个数据坏块,经过检查发现为审计对象,因此通过truncate然后insert数据即可格式化掉;

    另外第25号文件的数据坏块为逻辑坏块,通过查询确认为index,直接rebuild index即可。

    至于说最后第第35,36号文件,由于均为数据文件,涉及的对象均为分区表的partiiton,因此处理相对麻烦一些。

    这里我通过创建测试表,然后清理掉数据,构造一个数据空块,然后将该数据块复制到asm diskgroup中替换到这几个坏块即可,注意替换之前,需要修改数据坏块的rdba地址和obj id信息,如下是数据块的复制替换过程:

    SQL> @copy_block_to_asm.sql
    Enter value for file_with_patched_block: /u01/.../blk.dbf
    old  15: v_FsFileName := '&file_with_patched_block';
    new  15: v_FsFileName := '/u01/.../blk.dbf';
    Enter value for file_to_patch_in_asm: +DATA_KLYX/xxxx1/datafile/epmrange107.dbf
    old  16: v_AsmFileName := '&file_to_patch_in_ASM';
    new  16: v_AsmFileName := '+DATA_KLYX/xxxx1/datafile/epmrange107.dbf';
    Enter value for block_to_patch: 543804
    old  17: v_offstart := '&block_to_patch';
    new  17: v_offstart := '543804';
    PL/SQL procedure successfully completed.

    最后通过rman检查,确认一切正常。如下是rman的检测结果:

    .......
    RMAN> backup validate datafile 36;
    
    Starting backup at 16-OCT-15
    using target database control file instead of recovery catalog
    allocated channel: ORA_DISK_1
    channel ORA_DISK_1: SID=1337 instance=xxx1 device type=DISK
    channel ORA_DISK_1: starting full datafile backup set
    channel ORA_DISK_1: specifying datafile(s) in backup set
    input datafile file number=00036 name=+DATA_KLYX/xxx1/datafile/epmrange107.dbf
    channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
    List of Datafiles
    =================
    File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
    ---- ------ -------------- ------------ --------------- ----------
    36   OK     0              315258       3276800         14561048845049
      File Name: +DATA_KLYX/xxx1/datafile/epmrange107.dbf
      Block Type Blocks Failing Blocks Processed
      ---------- -------------- ----------------
      Data       0              2926354
      Index      0              20
      Other      0              35168
    ......

    Related posts:

    1. 最近迁移恢复中遇到的几个小问题
    2. 1.4TB ASM(RAC) 磁盘损坏恢复小记
    3. Oracle ASM crash & ORA-15196
    4. ORA-15196: invalid ASM block header [kfc.c:26076] [hard_kfbh]
    5. Oracle Recover Case: 50TB ASM crash case


沪ICP备19023445号-2号
友情链接