凌晨收到同事电话,反馈应用程序访问Oracle数据库时报错,当时现场现象确认:
1. 应用程序访问不了数据库,使用SQL Developer测试发现访问不了数据库。报ORA-12570 TNS:packet reader failure
2. 使用lsnrctl status检查监听,一直没有响应,这个是极少见的情况。
3. 检查数据库状态为OPEN,使用nmon检查系统资源。如下一张截图所示,CPU利用率不高,但是CPU Wait%非常高。这意味着I/O不正常。可能出现了IO等待和争用(IO waits and contention)
CPU Wait%:显示采集间隔内所有CPU处于空闲且等待I/O完成的时间比例,Wait%是CPU空闲状态的一种,当CPU处于空闲状态而又有进程处于D状态(不可中断睡眠)时,系统会统计这时的时间,并计算到Wait%里,Wait%不是一个时间值,而是时间的比例,因此在同样I/O Wait时间下,服务器CPU越多,Wait%越低,它体现了I/O操作与计算操作之间的比例。对I/O密集型的应用来说一般Wait%较高.)
4.打开邮件发现收到大量的监控告警日志作业发出的邮件,检查告警日志,发现里面有大量ORA错误信息,部分内容如下:
关于“ORA-00494: enqueue [CF] held for too long (more than 900 seconds).....”这个错误,我们先看看这个错误的相关描述:
出现ORA-00494 意味这Instance Crash了,可以参考官方文档 Database Crashes With ORA-00494 (文档 ID 753290.1):
This error can also be accompanied by ORA-600 [2103] which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds).
This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.
Cause#1: The lgwr has killed the ckpt process, causing the instance to crash.
From the alert.log we can see:
The database has waited too long for a CF enqueue, so the next error is reported:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 38356'
Then the LGWR killed the blocker, which was in this case the CKPT process which then causes the instance to crash.
Checking the alert.log further we can see that the frequency of redo log files switch is very high (almost every 1 min).
Cause#2: Checking the I/O State in the AWR report we find that:
Average Read per ms (Av Rd(ms)) for the database files which are located on this mount point " /oracle/oa1l/data/" is facing I/O issue as per the data collection which was perform
Cause#3: The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'
The ORA-00494 error occurs during periods of super-high stress, activity to the point there the server becomes unresponsive due to overloaded disk I/O, CPU or RAM.
从上面分析看,这三种原因都存在可能性。但是需要跟多的信息和证据来确认到底是什么原因导致ORA-00494错误, 以至数据库实例Crash。
1:告警日志里面有“ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'” 错误,CF指Control file schema global enqueue。如果一个进程在指定的时间(默认900秒)内无法获得CF锁,则CF锁的执行进程会被kill。这个参数为_controlfile_enqueue_timeout
检查redo log的切换频率,发现在2016-11-09 零点到2点,以及2016-11-08 22:00~ 24:00的redo log 切换频率都很低。排除有大量DML操作的可能性, 根据以上一些分析,我们还不能完全排除Cause#1。我们接着分析其他信息
2:关于 The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'
告警日志里面出现ORA-00239,但是没有出现ORA-603、ORA-00470之类的错误。按照官方文档Disk I/O Contention/Slow Can Lead to ORA-239 and Instance Crash (文档 ID 1068799.1)
I/O contention or slowness leads to control file enqueue timeout.
One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :
From the AWR:
1) high "log file parallel write" and "control file sequential read" waits
2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)
3) very high %iowait : 98.57%.
4) confirmed IO peak during that time
Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.
这台服务器已经正常运行了很多年,所以我们更倾向是IO问题导致。结合当时CPU Wait%非常高。这意味着可能出现了严重的IO等待和争用(IO waits and contention)
3:我们来看看监控工具OSWather生成这段时间的一些报告,如下,CPU资源非常空闲
Operating System CPU Utilization
CPU等待IO资源(Wait IO)也是从10:45 PM(22:45)之后变大。CPU利用率一直不高,最多20%多的样子。
Operating System CPU Other
然后,我们看看Operating System I/O吧,如下截图所示,可以看出在11点开始,系统IO设备非常繁忙 由此我们可以判断IO异常导致数据库出现ORA-00494错误的可能性很大。
Operating System I/O Throughput
然后我们检查一下操作系统的日志,如下所示:
如下截图所示,“INFO: task kjournald:xxx blocked for more than 120 seconds.”从23:22开始,在这之前,出现大量这类日志信息。这个是因为PlateSpin的作业复制导致(后面确认该作业在22:40启动)。所以至此,我们更倾向是因为第二个源于引起数据库Instance Crash。后面和系统管理员确认,PlateSpin的复制作业也是失败了。所以种种分析,非常怀疑是PlateSpin的作业引起了IO异常。而IO发生短暂或长时间停止响应的时候,就导致数据库实例崩溃。
后续处理解决
此时使用shutdown immediate关闭不了数据库,没有任何响应。只能shutdown abort,然后启动数据库实例,但是在startup时出现异常,报下面一些错误
关于这个错误,此处不做展开,可以参考ORA-01102: cannot mount database in EXCLUSIVE mode,kill掉大部分进程后,发现有三个进程使用kill -9 kill不掉,如下截图所示:
kill -9发送SIGKILL信号将其终止,但是以下两种情况不起作用:
a、该进程处于”Zombie”状态(使用ps命令返回defunct的进程)。此时进程已经释放所有资源,但还未得到其父进程的确认。”Zombie”进程要等到下次重启时才会消失,但它的存在不会影响系统性能。
b、 该进程处于”kernel mode”(核心态)且在等待不可获得的资源。处于核心态的进程忽略所有信号处理,因此对于这些一直处于核心态的进程只能通过重启系统实现。进程在Linux中会处于两种状态,即用户态和核心态。只有处于用户态的进程才可以用“kill”命令将其终止。
由于这些进程已经陷入核心态,而且很难自动唤醒,又不接受信号指令。不得已只能reboot系统了。 重启后问题解决。后面和系统管理员协商暂时停用PlateSpin作业,待周日重新做一个完整备份后,继续观察IO影响。