IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    datafile auto offlile due to i/o error

    admin发表于 2016-10-30 16:28:35
    love 0

    本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客

    本文链接地址: datafile auto offlile due to i/o error

    刚到酒店,就接到客户电话说某数据库的一个数据文件报IO错误,通过vpn查看发现如下:

    Sun Oct 30 23:19:27 BEIST 2016
    Trace dumping is performing id=[cdmp_20161030231927]
    Sun Oct 30 23:19:27 BEIST 2016
    Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
    ORA-00376: file 595 cannot be read at this time
    ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
    Sun Oct 30 23:19:31 BEIST 2016
    ORACLE Instance xxxx2 (pid = 22) - Error 376 encountered while recovering transaction (160, 1).
    Sun Oct 30 23:19:31 BEIST 2016
    Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
    ORA-00376: file 595 cannot be read at this time
    ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
    Sun Oct 30 23:19:32 BEIST 2016
    Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
    ORA-00376: file 595 cannot be read at this time
    ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
    Sun Oct 30 23:19:33 BEIST 2016

    我们不难看出,报错文件无法读;实际上我登录2个节点ls -tr检查发现权限都是正确的,同时通过dbv 检查该文件发现也无坏块;因此我断定直接recover该文件即可。在recover时,发现居然报错nfs挂载有问题:

    SQL> recover datafile 595;
    ORA-00279: change 15125505612642 generated at 10/30/2016 18:06:07 needed for thread 1
    ORA-00289: suggestion : /arch2/1_108445_815539661.dbf
    ORA-00280: change 15125505612642 for thread 1 is in sequence #108445
    
    Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
    /arch1/1_108445_815539661.dbf
    ORA-00308: cannot open archived log '/arch1/1_108445_815539661.dbf'
    ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
    Additional information: 6

    由此可见该环境问题还不少。既然本地节点无法读取,为了短时间内恢复正常,直接将部分归档cp到相应的归档目录中,再次进行recover即可成功online该文件,如下:

    Sun Oct 30 23:20:23 BEIST 2016
    alter database datafile 595 online
    Sun Oct 30 23:20:23 BEIST 2016
    Completed: alter database datafile 595 online
    Sun Oct 30 23:20:28 BEIST 2016
    SMON: Parallel transaction recovery tried
    Sun Oct 30 23:23:02 BEIST 2016
    Thread 2 advanced to log sequence 164113 (LGWR switch)
      Current log# 7 seq# 164113 mem# 0: /dev/rora_redo2_01

    我们可以看到,确实顺利online文件了,还好是归档的数据库。那么我们继续来分析一下,为什么会出现这个问题呢? 进一步搜索alert log发现该在下午18:15分开始出现I欧错误:

    Sun Oct 30 18:15:04 BEIST 2016
    KCF: write/open error block=0x29790 online=1
         file=595 /dev/rdata05vg_8g_48
         error=27063 txt: 'IBM AIX RISC System/6000 Error: 5: I/O error
    Additional information: -1
    Additional information: 131072'
    Automatic datafile offline due to write error on
    file 595: /dev/rdata05vg_8g_48
    Sun Oct 30 18:15:28 BEIST 2016
    Thread 2 advanced to log sequence 164100 (LGWR switch)
      Current log# 12 seq# 164100 mem# 0: /dev/rora_redo2_06
    Sun Oct 30 18:15:28 BEIST 2016
    Errors in file /oracle/app/10.2/admin/xxxx/udump/xxxx2_ora_28705020.trc:
    ORA-00372: file 595 cannot be modified at this time
    ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
    ORA-00372: file 595 cannot be modified at this time
    ORA-01110: data file 595: '/dev/rdata05vg_8g_48'

    我们可以看到,因为出现错误,Oracle自动将数据文件offline了,这其实是数据库的一直保护机制(没有相关隐含参数来控制)。到这里我怀疑多半是操作系统哪儿出问题了,果然errpt 查看发现在18:15出现了path error错误。

    oracle:xxx$(/oracle)errpt
    IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
    3D32B80D   1030181516 P S topsvcs        NIM thread blocked
    3D32B80D   1030181516 P S topsvcs        NIM thread blocked
    E86653C3   1030181516 P H LVDD           I/O ERROR DETECTED BY LVM
    B6267342   1030181516 P H hdisk46        DISK OPERATION ERROR
    DE3B8540   1030181516 P H hdisk46        PATH HAS FAILED
    DE3B8540   1030181416 P H hdisk46        PATH HAS FAILED
    
    oracle:xxxx$(/oracle/app/10.2/admin/xxxx/bdump)errpt -aj DE3B8540
    ---------------------------------------------------------------------------
    LABEL:          SC_DISK_ERR7
    IDENTIFIER:     DE3B8540
    
    Date/Time:       Sun Oct 30 18:15:00 BEIST 2016
    Sequence Number: 921
    Machine Id:      00F7A4904C00
    Node Id:         sti50l02
    Class:           H
    Type:            PERM
    WPAR:            Global
    Resource Name:   hdisk46
    Resource Class:  disk
    Resource Type:   Hitachi
    ......
    ......
    
    Description
    PATH HAS FAILED
    
    Probable Causes
    ADAPTER HARDWARE OR CABLE
    DASD DEVICE
    
    Failure Causes
    UNDETERMINED
    
            Recommended Actions
            PERFORM PROBLEM DETERMINATION PROCEDURES
            CHECK PATH
    
    Detail Data
    PATH ID

    不难看出,由于下午出现了相关错误,导致数据库出现了IO异常,oracle自动将文件offline了。然而我在刚刚lspath检查发现都ok,实际上也应该这样,否则recover datafile还会继续报IO错误。夜深了,到这里结束吧!简单记录一下!

     

    PS:

    1) NFS挂载的相关参数说明

    Operating System Mount options for Binaries ## Mount options for Oracle Datafiles Mount options for CRS Voting Disk and OCR
    Sun Solaris * rw,bg,hard,nointr,rsize=32768,
    wsize=32768,proto=tcp,noac,

    vers=3,suid

    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,proto=tcp,noac,
    forcedirectio, vers=3
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,proto=tcp,vers=3,
    noac,forcedirectio
    AIX (5L) ** rw,bg,hard,nointr,rsize=32768,
    wsize=32768,proto=tcp,

    vers=3,timeo=600

    cio,rw,bg,hard,nointr,rsize=32768,
    wsize=32768,proto=tcp,noac,
    vers=3,timeo=600
    cio,rw,bg,hard,intr,rsize=32768,
    wsize=32768,tcp,noac,
    vers=3,timeo=600
    HPUX 11.23 ***  – rw,bg,vers=3,proto=tcp,noac,
    hard,nointr,timeo=600,
    rsize=32768,wsize=32768,suid
    rw,bg,vers=3,proto=tcp,noac,
    forcedirectio,hard,nointr,timeo=600,
    rsize=32768,wsize=32768
    rw,bg,vers=3,proto=tcp,noac,
    forcedirectio,hard,nointr,timeo=600
    ,rsize=32768,wsize=32768
    Windows Not Supported Not Supported Not Supported
    Linux x86
    #
    ****
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp, vers=3,
    timeo=600, actimeo=0
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,actimeo=0,
    vers=3,timeo=600
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,noac,actimeo=0,
    vers=3,timeo=600
    Linux x86-64 #
    ****
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,vers=3,
    timeo=600, actimeo=0
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,actimeo=0,
    vers=3,timeo=600
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,noac,vers=3,
    timeo=600,actimeo=0
    Linux – Itanium rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,vers=3,
    timeo=600, actimeo=0
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,actimeo=0,
    vers=3,timeo=600
    rw,bg,hard,nointr,rsize=32768,
    wsize=32768,tcp,noac,vers=3,
    timeo=600,actimeo=0

    * NFS mount option “forcedirectio” is required on Solaris platforms when mounting the OCR/CRS files when using Oracle 10.1.0.4 or 10.2.0.2 or later (Oracle unpublished bug 4466428)
    ** AIX is only supported with NAS on AIX 5.3 TL04 and higher with Oracle 10.2.0.1 and later
    *** NAS devices are only supported with HPUX 11.23 or higher ONLY
    # These mount options are for Linux kernels 2.6 and above. For older kernels please check Note 279393.1

    ## The stated mount options for binaries are applicable only if the ORACLE HOME is shared.

    Due to Unpublished bug 5856342, it is necessary to use the following init.ora parameter when using NAS with all versions of RAC on Linux (x86 & X86-64 platforms) until 10.2.0.4. This bug is fixed and included in 10.2.0.4 patchset.
    filesystemio_options = DIRECTIO

     

    2) 如果是Oracle 11.2.0.2版本开始,在没有安装Patch 7691270的情况之下,Oracle在遇到IO错误之后,会自动将数据库crash掉,其中有个相关的隐含参数:_datafile_write_errors_crash_instance

    该参数在11.2.0.2版本之后默认为true,包含最新的11.2.0.4版本。

    Enter value for par: datafile_write
    old   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%&par%'
    new   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%datafile_write%'
    
    NAME                                               VALUE                DESCRIB
    -------------------------------------------------- -------------------- ------------------------------------------------------------
    _datafile_write_errors_crash_instance              TRUE                 datafile write errors crash instance

    Related posts:

    1. win 环境 O/S-Error: (OS 23) 数据错误(循环冗余检查) —恢复
    2. Archivelog 模式下,datafile header损坏,如何恢复?
    3. datafile 也能跨resetlogs ?
    4. 11.2.0.4 RAC CRS diskgroup auto dismount问题


沪ICP备19023445号-2号
友情链接