IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    自动精简配置(Thin Provisioning)导致asm异常

    惜分飞发表于 2016-12-16 15:34:33
    love 0

    联系:手机(13429648788) QQ(107644445)

    链接:http://www.orasos.com/%e8%87%aa%e5%8a%a8%e7%b2%be%e7%ae%80%e9%85%8d%e7%bd%aethin-provisioning%e5%af%bc%e8%87%b4asm%e5%bc%82%e5%b8%b8.html

    标题:自动精简配置(Thin Provisioning)导致asm异常

    作者:惜分飞©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

    有朋友在一个存储空间给asm使用,发生空间不足,然后使用另外一个存储中的lun给asm的data磁盘组增加asm disk,运行了大概1天之后,asm磁盘组直接dismount,数据库crash.然后就无法正常mount.包括这个存储上的几个其他磁盘组也无法正常mount.
    数据库异常日志

    Sun Oct 23 08:43:59 2016
    SUCCESS: diskgroup DATA was dismounted
    SUCCESS: diskgroup DATA was dismounted
    Sun Oct 23 08:44:00 2016
    Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lmon_79128.trc:
    ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
    ORA-15078: ASM diskgroup was forcibly dismounted
    Sun Oct 23 08:44:00 2016
    Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
    ORA-00345: redo log write error block 15924 count 2
    ORA-00312: online log 2 thread 1: '+DATA/orcl/onlinelog/group_2.274.892363167'
    ORA-15078: ASM diskgroup was forcibly dismounted
    ORA-15078: ASM diskgroup was forcibly dismounted
    Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
    ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
    ORA-15078: ASM diskgroup was forcibly dismounted
    Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lgwr_79174.trc:
    ORA-00204: error in reading (block 1, # blocks 1) of control file
    ORA-00202: control file: '+DATA/orcl/controlfile/current.278.892363163'
    ORA-15078: ASM diskgroup was forcibly dismounted
    Sun Oct 23 08:44:00 2016
    LGWR (ospid: 79174): terminating the instance due to error 204
    Sun Oct 23 08:44:00 2016
    opiodr aborting process unknown ospid (79742) as a result of ORA-1092
    Sun Oct 23 08:44:01 2016
    ORA-1092 : opitsk aborting process
    Sun Oct 23 08:44:01 2016
    ORA-1092 : opitsk aborting process
    System state dump requested by (instance=1, osid=79174 (LGWR)), summary=[abnormal instance termination].
    System State dumped to trace file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_79118.trc
    Instance terminated by LGWR, pid = 79174
    

    很明显,数据库异常是由于asm diskgroup dismount,因此分析asm 日志

    asm 日志

    Sun Oct 23 07:00:31 2016
    Time drift detected. Please check VKTM trace file for more details.
    Sun Oct 23 08:43:55 2016
    Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8755.trc:
    ORA-27061: waiting for async I/Os failed
    Linux-x86_64 Error: 5: Input/output error
    Additional information: -1
    Additional information: 1048576
    WARNING: Write Failed. group:1 disk:2 AU:1222738 offset:0 size:1048576
    ERROR: failed to copy file +DATA.524, extent 15030
    ERROR: ORA-15080 thrown in ARB0 for group number 1
    Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8755.trc:
    ORA-15080: synchronous I/O operation to a disk failed
    Sun Oct 23 08:43:55 2016
    NOTE: stopping process ARB0
    NOTE: rebalance interrupted for group 1/0xec689cdd (DATA)
    NOTE: ASM did background COD recovery for group 1/0xec689cdd (DATA)
    NOTE: starting rebalance of group 1/0xec689cdd (DATA) at power 1
    Starting background process ARB0
    Sun Oct 23 08:43:56 2016
    ARB0 started with pid=24, OS id=103554 
    NOTE: assigning ARB0 to group 1/0xec689cdd (DATA) with 1 parallel I/O
    Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_103554.trc:
    ORA-27061: waiting for async I/Os failed
    Linux-x86_64 Error: 5: Input/output error
    Additional information: -1
    Additional information: 1048576
    WARNING: Write Failed. group:1 disk:2 AU:1222738 offset:0 size:1048576
    ERROR: failed to copy file +DATA.256, extent 6570
    ERROR: ORA-15080 thrown in ARB0 for group number 1
    Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_103554.trc:
    ORA-15080: synchronous I/O operation to a disk failed
    NOTE: stopping process ARB0
    Sun Oct 23 08:43:58 2016
    Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dbw0_8521.trc:
    ORA-27061: waiting for async I/Os failed
    Linux-x86_64 Error: 5: Input/output error
    Additional information: -1
    Additional information: 4096
    WARNING: Write Failed. group:1 disk:3 AU:6789 offset:24576 size:4096
    NOTE: cache initiating offline of disk 3 group DATA
    NOTE: process _dbw0_+asm1 (8521) initiating offline of disk 3.3915934787 (DATA_0003) with mask 0x7e in group 1
    Sun Oct 23 08:43:58 2016
    WARNING: Disk 3 (DATA_0003) in group 1 mode 0x7f is now being offlined
    WARNING: Disk 3 (DATA_0003) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
    NOTE: initiating PST update: grp = 1, dsk = 3/0xe9686c43, mask = 0x6a, op = clear
    GMON updating disk modes for group 1 at 14 for pid 14, osid 8521
    ERROR: Disk 3 cannot be offlined, since diskgroup has external redundancy.
    ERROR: too many offline disks in PST (grp 1)
    Sun Oct 23 08:43:58 2016
    NOTE: cache dismounting (not clean) group 1/0xEC689CDD (DATA) 
    NOTE: messaging CKPT to quiesce pins Unix process pid: 103577, image: oracle@node1 (B000)
    WARNING: Disk 3 (DATA_0003) in group 1 mode 0x7f offline is being aborted
    WARNING: Offline of disk 3 (DATA_0003) in group 1 and mode 0x7f failed on ASM inst 1
    NOTE: halting all I/Os to diskgroup 1 (DATA)
    Sun Oct 23 08:43:59 2016
    NOTE: LGWR doing non-clean dismount of group 1 (DATA)
    NOTE: LGWR sync ABA=160.10145 last written ABA 160.10145
    

    错误信息很明显,由于Write Failed导致asm diskgroup dismount.

    系统日志

    Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Sense Key : Data Protect [current] 
    Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd]  Add. Sense: Space allocation failed write protect
    Oct 23 08:43:55 node1 kernel: sd 6:0:12:1: [sdd] CDB: Write(16): 8a 00 00 00 00 02 e7 18 37 f9 00 00 00 07 00 00
    Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdd, sector 12467058681
    Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467058681
    Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Sense Key : Data Protect [current] 
    Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh]  Add. Sense: Space allocation failed write protect
    Oct 23 08:43:55 node1 kernel: sd 8:0:6:1: [sdh] CDB: Write(16): 8a 00 00 00 00 02 e7 18
    Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
    Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb]   33Add. Sense: Space allocation failed write protect
    Oct 23 08:43:55 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 02 e7 18 30 00 00 00 03 f9 00 00
    Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdb, sector 12467056640
    Oct 23 08:43:55 node1 kernel: f9 00 00 04 00
    Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467056640
    Oct 23 08:43:55 node1 kernel: 00 00
    Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev sdh, sector 12467057657
    Oct 23 08:43:55 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467057657
    Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Sense Key : Data Protect [current] 
    Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh]  Add. Sense: Space allocation failed write protect
    Oct 23 08:43:57 node1 kernel: sd 8:0:6:1: [sdh] CDB: Write(16): 8a 00 00 00 00 02 e7 18 37 f9 00 00 00 07 00 00
    Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev sdh, sector 12467058681
    Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467058681
    Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Sense Key : Data Protect [current] 
    Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj]  Add. Sense: Space allocation failed write protect
    Oct 23 08:43:57 node1 kernel: sd 8:0:12:1: [sdj] CDB: Write(16): 8a 00 00 00 00 02 e7 18 30 00 00 00 03 f9 00 00
    Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev sdj, sector 12467056640
    Oct 23 08:43:57 node1 kernel: end_request: critical space allocation error, dev dm-3, sector 12467056640
    Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
    Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb]  Add. Sense: Space allocation failed write protect
    Oct 23 08:43:57 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 02 e7 18 33 f9 00 00 04 00 00 00
    Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Sense Key : Data Protect [current] 
    Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb]  Add. Sense: Space allocation failed write protect
    Oct 23 08:43:58 node1 kernel: sd 6:0:4:1: [sdb] CDB: Write(16): 8a 00 00 00 00 03 3b 7e 78 30 00 00 00 08 00 00
    Oct 23 10:50:59 node1 init: oracle-ohasd main process (6150) killed by TERM signal
    

    错误信息为:critical space allocation error,严重空间分配错误.也就是linux在分配空间之时发生错误.在换而言之,由于分配空间错误导致asm 磁盘组dismount.

    查看多路径信息

    [root@node1 ~]# multipath -ll
    36000d31003190c000000000000000003 dm-3 COMPELNT,Compellent Vol
    size=80T features='1 queue_if_no_path' hwhandler='0' wp=rw
    `-+- policy='round-robin 0' prio=1 status=active
      |- 6:0:9:1  sdd 8:48  active ready running
      `- 8:0:9:1  sdi 8:128 active ready running
    delldisk2 (36000d310031908000000000000000003) dm-4 COMPELNT,Compellent Vol
    size=8.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
    `-+- policy='round-robin 0' prio=1 status=active
      |- 6:0:12:1 sde 8:64  active ready running
      |- 8:0:6:1  sdh 8:112 active ready running
      |- 6:0:4:1  sdb 8:16  active ready running
      `- 8:0:12:1 sdj 8:144 active ready running
    delldisk1 (36000d31003190a000000000000000007) dm-2 COMPELNT,Compellent Vol
    size=12T features='1 queue_if_no_path' hwhandler='0' wp=rw
    `-+- policy='round-robin 0' prio=1 status=active
      |- 6:0:1:1  sda 8:0   active ready running
      |- 8:0:2:1  sdf 8:80  active ready running
      |- 6:0:7:1  sdc 8:32  active ready running
      `- 8:0:3:1  sdg 8:96  active ready running
    

    很明显报错的都是同一个lun(delldisk2),也就是存储空间使用完的存储.也就是说,由于delldisk2存储的空间使用尽了导致系统出现分配空间错误,从而导致asm 写失败,进而导致数据库异常.这种问题的本质其实就是存储给系统分配了8T,但是实际存储可以使用的空间不足8T,而os按照8T来使用从而出现该问题.专业名字叫做”存储精简卷”.因此各位在存储配置之时需要注意该问题.因为这种情况的出现一般只是写io异常,读依旧正常,因此不会丢失数据.

    • ASM中磁盘组权限设置
    • root 用户操作 ORACLE 数据库导致悲剧
    • Liunx系统中Oracle使用HugePages配置
    • ORA-600 999
    • asm disk格式化恢复
    • asmlib异常报ORA-00600[kfklLibFetchNext00]
    • oracleasm createdisk重新创建asm disk后数据0丢失恢复案例
    • 因域名解析导致数据库连接延迟分析
    • Linux逻辑卷管理(LVM)
    • Heartbeat安装及简单配置
    • 分区无法识别导致asm diskgroup无法mount
    • 块修改跟踪特性(Block Change Tracking)
    • ORA-15032 ORA-15040 ORA-15042 asm故障恢复
    • ALERT: Disable Transparent HugePages on SLES11, RHEL6, OEL6 and UEK2 Kernels
    • 处理fast_recovery_area无剩余空间案例


沪ICP备19023445号-2号
友情链接