原文链接: http://www.dbaleet.org/exadata_sundiag_/
sundiag是exadata上收集磁盘信息的利器,这个脚本十分全面,涵盖了与磁盘包括flash盘有关的所有信息。在早期的与HP合 作的Exadata V1上不存在这个脚本,从脚本的命名方式就能看出来,是属于收购sun以后的产物。Exadata V1类似的脚本,名字叫GetSCConf.zip , 相比sundiag而言,这个脚本非常简陋,仅仅只能查看cell, lun, physicaldisk, celldisk, griddisk的信息。
在exadata image 11.2.2.2.0版本以后都自带这个脚本,此脚本位于# /opt/oracle.SupportTools/sundiag.sh。Solaris X86也存在对应的版本,收集到的信息与Linux基本一致。
如果您怀疑Exadata机器的磁盘有损坏,例如机器有盘亮黄灯,就可以在这台机器上运行sundiag,完成以后会生成对应的一个压缩包,然后将 其上传到SR,供Exadata后台支持工程师分析。在很多硬件相关的SR中,SR工程师通常也要求客户上传sundiag的信息以供诊断。所以掌握 sundiag的使用对于Exadata维护工程师至关重要。一下就是对这个脚本进行简单的分析,看sundiag收集了一些什么信息。
这个脚本本身比较简单,但是涵盖了所有与磁盘相关的信息。
#!/bin/bash # Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved. megacli_status () { CONT="a0" STATUS=0 MEGACLI=/opt/MegaRAID/MegaCli/MegaCli64 echo -n "Checking RAID status on " hostname for a in $CONT do NAME=`$MEGACLI -AdpAllInfo -$a |grep "Product Name" | cut -d: -f2` echo "Controller $a: $NAME" noonline=`$MEGACLI PDList -$a | grep Online | wc -l` echo "No of Physical disks online : $noonline" DEGRADED=`$MEGACLI -AdpAllInfo -a0 |grep "Degrade"` echo $DEGRADED NUM_DEGRADED=`echo $DEGRADED |cut -d" " -f3` [ "$NUM_DEGRADED" -ne 0 ] && STATUS=1 FAILED=`$MEGACLI -AdpAllInfo -a0 |grep "Failed Disks"` echo $FAILED NUM_FAILED=`echo $FAILED |cut -d" " -f4` [ "$NUM_FAILED" -ne 0 ] && STATUS=1 done return $STATUS } datestamp="`date +%Y_%m_%d_%H_%M`" mkdir -p /tmp/sundiag_$datestamp cd /tmp/sundiag_$datestamp cp /var/log/messages* . /bin/dmesg > `hostname -a`_dmesg_$datestamp.out /opt/oracle.cellos/imageinfo -all > `hostname -a`_imageinfo-all_$datestamp.out /sbin/lspci > `hostname -a`_lspci_$datestamp.out /sbin/lspci -xxxx > `hostname -a`_lspci-xxxx_$datestamp.out /usr/bin/lsscsi > `hostname -a`_lsscsi_$datestamp.out /sbin/fdisk -l > `hostname -a`_fdisk-l_$datestamp.out /usr/bin/ipmitool sel elist > `hostname -a`_sel-list_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL > `hostname -a`_megacli64-AdpAllInfo_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | awk '/Slot Number/ { counter += 1; slot[counter] = $3 } /Device Id/ { device[counter] = $3 } /Firmware state/ { state_drive[counter] = $3 } /Inquiry/ { name_drive[counter] = $3 " " $4 " " $5 " " $6 } END { for (i=1; i<=counter; i+=1) printf ( "Slot %02d Device %02d (%s) status is: %s \n", slot[i], device[i], name_drive[i], state_drive[i]); }' > `hostname -a`_megacli64-PdList_short_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -AdpEventLog -GetEvents -f ./`hostname -a`_megacli64-GetEvents-all_$datestamp.out -aALL /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -dsply -aALL > `hostname -a`_megacli64-FwTermLog_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL > `hostname -a`_megacli64-CfgDsply_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -aALL > `hostname -a`_megacli64-BbuCmd_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aALL > `hostname -a`_megacli64-LdPdInfo_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL > `hostname -a`_megacli64-PdList_long_$datestamp.out /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aALL > `hostname -a`_megacli64-LdInfo_$datestamp.out if [ -f /opt/oracle.cellos/ORACLE_CELL_NODE ]; then cellcli -e list cell detail > `hostname -a`_cell-detail_$datestamp.out cellcli -e list celldisk detail > `hostname -a`_celldisk-detail_$datestamp.out cellcli -e list lun detail > `hostname -a`_lun-detail_$datestamp.out cellcli -e list physicaldisk detail > `hostname -a`_physicaldisk-detail_$datestamp.out cellcli -e list physicaldisk where status!=normal > `hostname -a`_physicaldisk-fail_$datestamp.out cellcli -e list griddisk detail > `hostname -a`_griddisk-detail_$datestamp.out cellcli -e list flashcache detail > `hostname -a`_flashcache-detail_$datestamp.out cellcli -e list alerthistory > `hostname -a`_alerthistory_$datestamp.out sh /opt/oracle/cell/cellsrv/deploy/scripts/unix/hwadapter/diskadp/scripts_aura.sh > `hostname -a`_scripts-aura_$datestamp.out perl /opt/oracle/cell/cellsrv/deploy/scripts/unix/hwadapter/diskadp/get_disk_devices.pl 5042 > `hostname -a`_disk_devices_$datestamp.out cp /opt/oracle/cell/log/diag/asm/cell/`hostname -a`/trace/alert.log . cp /opt/oracle/cell/log/diag/asm/cell/`hostname -a`/trace/ms-odl.trc . /usr/bin/flash_dom -l > `hostname -a`_fdom-l_$datestamp.out #get data on the list of flash disks flash_list=`/usr/bin/lsscsi | grep MARVELL | awk '{print $7}'` for dev in $flash_list; do echo "aurasmart for $dev" >> `hostname -a`_aurasmart_$datestamp.out aurasmart -D $dev -N >> `hostname -a`_aurasmart_$datestamp.out aurasmart -v -d $dev -N >> `hostname -a`_aurasmart_$datestamp.out done fi if [ $# -eq 1 ]; then case "$1" in osw) #copy the oswatcher archive cp -r /opt/oracle.oswatcher/osw/archive . ;; *) echo "Unknown option: $1. Usage: sundiag.sh [osw]" esac fi megacli_status > `hostname -a`_megacli64-status_$datestamp.out cd /tmp tar -pjcvf /tmp/sundiag_$datestamp.tar.bz2 sundiag_$datestamp echo "==============================================================================" echo "Done the report files are in bzip2 compressed /tmp/sundiag_$datestamp.tar.bz2" echo "==============================================================================" /bin/rm -rf sundiag_$datestamp exit 0
1. 收集了操作系统的系统日志,/var/log/message
2. 收集了系统开机信息,dmesg
3. 收集了Exadata的image版本信息, imageinfo
4. 收集了pci的硬件信息以及其十六进制配置信息 lspci
5. 收集了scsi设备信息 lsscsi
6. 收集了分区表的信息 fdisk
7. 收集了目标服务器的系统日志SEL信息
8. 收集SAS磁盘控制器的所有信息,MegaCli64
如果是cell 节点,那么则会额外收集以下信息
9. 收集所有cell的详细信息
10. 收集所有celldisk的详细信息
11. 收集所有LUN的详细信息
12. 收集所有physical disk的详细信息
13. 收集所有非正常状态的physical disk的详细信息
14. 收集所有griddisk的详细信息
15. 收集所有flashcache的详细信息
16. 收集alerthistory告警历史信息。
17. 收集F20flash卡(FDOM)的信息。
18. 收集磁盘信息
19. 收集cell的告警日志信息
20. 收集cell节点的management service进程的trace信息。
当然sundiag还可以与exadata上的oswatcher集成,运行# /opt/oracle.SupportTools/sundiag.sh osw就可以同时收集本机osw的信息。