IT博客汇 | datafile auto offlile due to i/o error

datafile auto offlile due to i/o error

admin发表于 2016-10-30 16:28:35

本站文章除注明转载外，均为本站原创： 转载自love wife & love life —Roger 的Oracle技术博客

本文链接地址: datafile auto offlile due to i/o error

刚到酒店，就接到客户电话说某数据库的一个数据文件报IO错误，通过vpn查看发现如下：

Sun Oct 30 23:19:27 BEIST 2016
Trace dumping is performing id=[cdmp_20161030231927]
Sun Oct 30 23:19:27 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
ORA-00376: file 595 cannot be read at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
Sun Oct 30 23:19:31 BEIST 2016
ORACLE Instance xxxx2 (pid = 22) - Error 376 encountered while recovering transaction (160, 1).
Sun Oct 30 23:19:31 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
ORA-00376: file 595 cannot be read at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
Sun Oct 30 23:19:32 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/bdump/xxxx2_smon_11863216.trc:
ORA-00376: file 595 cannot be read at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
Sun Oct 30 23:19:33 BEIST 2016

我们不难看出，报错文件无法读；实际上我登录2个节点ls -tr检查发现权限都是正确的，同时通过dbv 检查该文件发现也无坏块；因此我断定直接recover该文件即可。在recover时，发现居然报错nfs挂载有问题：

SQL> recover datafile 595;
ORA-00279: change 15125505612642 generated at 10/30/2016 18:06:07 needed for thread 1
ORA-00289: suggestion : /arch2/1_108445_815539661.dbf
ORA-00280: change 15125505612642 for thread 1 is in sequence #108445

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/arch1/1_108445_815539661.dbf
ORA-00308: cannot open archived log '/arch1/1_108445_815539661.dbf'
ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
Additional information: 6

由此可见该环境问题还不少。既然本地节点无法读取，为了短时间内恢复正常，直接将部分归档cp到相应的归档目录中，再次进行recover即可成功online该文件，如下：

Sun Oct 30 23:20:23 BEIST 2016
alter database datafile 595 online
Sun Oct 30 23:20:23 BEIST 2016
Completed: alter database datafile 595 online
Sun Oct 30 23:20:28 BEIST 2016
SMON: Parallel transaction recovery tried
Sun Oct 30 23:23:02 BEIST 2016
Thread 2 advanced to log sequence 164113 (LGWR switch)
  Current log# 7 seq# 164113 mem# 0: /dev/rora_redo2_01

我们可以看到，确实顺利online文件了，还好是归档的数据库。那么我们继续来分析一下，为什么会出现这个问题呢？进一步搜索alert log发现该在下午18：15分开始出现I欧错误：

Sun Oct 30 18:15:04 BEIST 2016
KCF: write/open error block=0x29790 online=1
     file=595 /dev/rdata05vg_8g_48
     error=27063 txt: 'IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 131072'
Automatic datafile offline due to write error on
file 595: /dev/rdata05vg_8g_48
Sun Oct 30 18:15:28 BEIST 2016
Thread 2 advanced to log sequence 164100 (LGWR switch)
  Current log# 12 seq# 164100 mem# 0: /dev/rora_redo2_06
Sun Oct 30 18:15:28 BEIST 2016
Errors in file /oracle/app/10.2/admin/xxxx/udump/xxxx2_ora_28705020.trc:
ORA-00372: file 595 cannot be modified at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'
ORA-00372: file 595 cannot be modified at this time
ORA-01110: data file 595: '/dev/rdata05vg_8g_48'

我们可以看到，因为出现错误，Oracle自动将数据文件offline了，这其实是数据库的一直保护机制（没有相关隐含参数来控制）。到这里我怀疑多半是操作系统哪儿出问题了，果然errpt 查看发现在18：15出现了path error错误。

oracle:xxx$(/oracle)errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
3D32B80D   1030181516 P S topsvcs        NIM thread blocked
3D32B80D   1030181516 P S topsvcs        NIM thread blocked
E86653C3   1030181516 P H LVDD           I/O ERROR DETECTED BY LVM
B6267342   1030181516 P H hdisk46        DISK OPERATION ERROR
DE3B8540   1030181516 P H hdisk46        PATH HAS FAILED
DE3B8540   1030181416 P H hdisk46        PATH HAS FAILED

oracle:xxxx$(/oracle/app/10.2/admin/xxxx/bdump)errpt -aj DE3B8540
---------------------------------------------------------------------------
LABEL:          SC_DISK_ERR7
IDENTIFIER:     DE3B8540

Date/Time:       Sun Oct 30 18:15:00 BEIST 2016
Sequence Number: 921
Machine Id:      00F7A4904C00
Node Id:         sti50l02
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   hdisk46
Resource Class:  disk
Resource Type:   Hitachi
......
......

Description
PATH HAS FAILED

Probable Causes
ADAPTER HARDWARE OR CABLE
DASD DEVICE

Failure Causes
UNDETERMINED

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES
        CHECK PATH

Detail Data
PATH ID

不难看出，由于下午出现了相关错误，导致数据库出现了IO异常，oracle自动将文件offline了。然而我在刚刚lspath检查发现都ok，实际上也应该这样，否则recover datafile还会继续报IO错误。夜深了，到这里结束吧！简单记录一下！

PS：

1) NFS挂载的相关参数说明

Operating System	Mount options for Binaries ##	Mount options for Oracle Datafiles	Mount options for CRS Voting Disk and OCR
Sun Solaris *	rw,bg,hard,nointr,rsize=32768, wsize=32768,proto=tcp,noac, vers=3,suid	rw,bg,hard,nointr,rsize=32768, wsize=32768,proto=tcp,noac, forcedirectio, vers=3	rw,bg,hard,nointr,rsize=32768, wsize=32768,proto=tcp,vers=3, noac,forcedirectio
AIX (5L) **	rw,bg,hard,nointr,rsize=32768, wsize=32768,proto=tcp, vers=3,timeo=600	cio,rw,bg,hard,nointr,rsize=32768, wsize=32768,proto=tcp,noac, vers=3,timeo=600	cio,rw,bg,hard,intr,rsize=32768, wsize=32768,tcp,noac, vers=3,timeo=600
HPUX 11.23 *** –	rw,bg,vers=3,proto=tcp,noac, hard,nointr,timeo=600, rsize=32768,wsize=32768,suid	rw,bg,vers=3,proto=tcp,noac, forcedirectio,hard,nointr,timeo=600, rsize=32768,wsize=32768	rw,bg,vers=3,proto=tcp,noac, forcedirectio,hard,nointr,timeo=600 ,rsize=32768,wsize=32768
Windows	Not Supported	Not Supported	Not Supported
Linux x86 # ****	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp, vers=3, timeo=600, actimeo=0	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,actimeo=0, vers=3,timeo=600	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,noac,actimeo=0, vers=3,timeo=600
Linux x86-64 # ****	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,vers=3, timeo=600, actimeo=0	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,actimeo=0, vers=3,timeo=600	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,noac,vers=3, timeo=600,actimeo=0
Linux – Itanium	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,vers=3, timeo=600, actimeo=0	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,actimeo=0, vers=3,timeo=600	rw,bg,hard,nointr,rsize=32768, wsize=32768,tcp,noac,vers=3, timeo=600,actimeo=0

* NFS mount option “forcedirectio” is required on Solaris platforms when mounting the OCR/CRS files when using Oracle 10.1.0.4 or 10.2.0.2 or later (Oracle unpublished bug 4466428)
** AIX is only supported with NAS on AIX 5.3 TL04 and higher with Oracle 10.2.0.1 and later
*** NAS devices are only supported with HPUX 11.23 or higher ONLY
# These mount options are for Linux kernels 2.6 and above. For older kernels please check Note 279393.1

## The stated mount options for binaries are applicable only if the ORACLE HOME is shared.

Due to Unpublished bug 5856342, it is necessary to use the following init.ora parameter when using NAS with all versions of RAC on Linux (x86 & X86-64 platforms) until 10.2.0.4. This bug is fixed and included in 10.2.0.4 patchset.
filesystemio_options = DIRECTIO

2) 如果是Oracle 11.2.0.2版本开始，在没有安装Patch 7691270的情况之下，Oracle在遇到IO错误之后，会自动将数据库crash掉，其中有个相关的隐含参数：_datafile_write_errors_crash_instance

该参数在11.2.0.2版本之后默认为true，包含最新的11.2.0.4版本。

Enter value for par: datafile_write
old   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%&par%'
new   3:  WHERE x.indx = y.indx AND x.ksppinm LIKE '%datafile_write%'

NAME                                               VALUE                DESCRIB
-------------------------------------------------- -------------------- ------------------------------------------------------------
_datafile_write_errors_crash_instance              TRUE                 datafile write errors crash instance