IT博客汇 | ORA-00600: internal error code, arguments: [ktecgeb-2]

ORA-00600: internal error code, arguments: [ktecgeb-2]

admin发表于 2015-07-21 09:26:24

环境: aix 6.1 oracle 10.2.0.4

故障现象:数据库遭遇600[ktecgeb-2],open一会自动宕掉。

alert日志如下:
SMON encountered 8 out of maximum 100 non-fatal internal errors.
Tue Jul 21 14:31:32 2015
Errors in file /oracle/app/admin/gzzhjs/bdump/gzzhjs_smon_7406000.trc:
ORA-00600: internal error code, arguments: [ktecgeb-2], [603979776], [0], [], [], [], [], []
Tue Jul 21 14:31:35 2015
Errors in file /oracle/app/admin/gzzhjs/bdump/gzzhjs_pmon_20971898.trc:
ORA-00474: SMON process terminated with error
Tue Jul 21 14:31:35 2015
PMON: terminating instance due to error 474
Instance terminated by PMON, pid = 20971898

解决办法:

Ora-00600: Internal Error Code, Arguments: [Ktecgeb-2], [1572], [0], [], [], [], (文档 ID 787069.1)

In this Document

Cause

Applies to:

Oracle Database – Enterprise Edition – Version 11.1.0.7 to 11.1.0.7 [Release 11.1]
Information in this document applies to any platform.
***Checked for relevance on 26-Nov-2012***

Symptoms

Alert log keeps reporting:

Sat Feb 21 19:23:25 2009
Errors in file b:\oracle\diag\rdbms\<dbs>\<sid>\trace\<sid>_smon_4364.trc (incident=692119):
ORA-00600: internal error code, arguments: [ktecgeb-2], [1572], [0], [], [], [], [], [], [], [],[], []
Non-fatal internal error happenned while SMON was doing temporary segment drop.

Due to SMON crashed the database did crash and will crash again during startup of database.

Stacktrace of SMON is like:
… <- ktecgeb <- kteopshrink <- ktssdrbm_segment <- ktssdro_segment <- ktssdt_segs <- ktmmon <- …

Cause

There is a problem during cleaning up of temporary segment(s).
As known SMON is responsible for cleanup of temporary segments and in this specific case the bitmap of the temporary segment(s) is corrupt.

Likely cause of the error as hit is:

Oracle 11.1 and 11.2:
Bug 9495481 – ORA-600 [KTECGEB-2] WHILE SMON WAS DOING TEMPORARY SEGMENT DROP
Oracle 10.2:
Unpublished Bug 5743780 – ORA-600 [KTECGEB-2] BAD EXTENT MAP BLOCK IN ASSM APPLYING 9I REDO IN 10.2

The error did occur at moment that 9i redo was applied in 10.2/11.1 rdbms (issue hit at/during upgrade of database)

Solution

The solution of the problem as hit will consist of 2 actions:
a) actions to prevent the problem in reoccurring
b) resolve the current problem as hit

a) actions to prevent the problem in reoccurring
Bug 9495481 as referred before has been logged to address the problem as hit in Oracle 11.1.0.7.0 (but is also applicable to 11.2)
Checking this bug learns that the error can occur as well in Oracle 10.2 which is addressed in non-published Bug 5743780.

The error does occur when applying 9i redo against 10.2/11.1/11.2 version of database which will/might be the case during an upgrade.
In case of upgrade the action would be to ensure that the version being upgrading to does contain the fix for this issue before upgrade being done:
- in case of Oracle 10.2 install patchset 10.2.0.5.0
- in case of Oracle 11.1 install one-off fix for Bug 9495481 (on top of 11.1.0.7)
- in case of Oracle 11.2 install patchset 11.2.0.3.0

The fix is included in 12.1

In case there is no option to redo the upgrade the action would be to resolve the current problem and install the fix afterward.

b) resolve the current problem:
- thorough diagnose of problem as hit ==> ensuring that steps to repair are the correct ones
- repair actions

DIAGNOSE
=========
The 1st action is to startup the database without SMON trying to cleanup the temporary segment(s):
==> only ment to be used for this diagnose/repair action only!!

A PFILE will be used for all repair actions which makes it easy to back out all needed parameters

$> cp $ORACLE_HOME/dbs/init$ORACLE_SID.ora /tmp

$> sqlplus /nolog << EOF
connect / as sysdba
startup mount;
create pfile='/tmp/init$ORACLE_SID.ora' from spfile;
shutdown immediate
EOF

Add the following parameter to PFILE in order to disable SMON in cleanup of temporary segments:

event="10061 trace name context forever, level 10"

Startup the database in restricted mode using the PFILE:

$> sqlplus /nolog << EOF
connect / as sysdba
startup restrict pfile=/tmp/init$ORACLE_SID.ora
EOF

Check which temporary segments are present in the database:

$> sqlplus /nolog << EOF
connect / as sysdba
set pagesize 9999
select * from dba_segments where segment_type='TEMPORARY';
EOF

Outcome of above will be something like:

OWNER SEGMENT_NAME PARTITION_NAME SEGMENT_TYPE SEGMENT_SU TABLESPACE_NAME HEADER_FILE HEADER_BLOCK BYTES BLOCKS EXTENTS INITIAL_EXTENT NEXT_EXTENT MIN_EXTENTS MAX_EXTENTS MAX_SIZE RETENTI MINRETENTION PCT_INCREASE FREELISTS FREELIST_GROUPS RELATIVE_FNO BUFFER_

DISCH_SUM 8.140238 TEMPORARY DISCH_SUM 8 140238 8471969792 1034176 16159 524288 524288 1 2147483645 0 8 DEFAULT

DISCH_SUM 8.148174 TEMPORARY DISCH_SUM 8 148174 2564292608 313024 4891 524288 524288 1 2147483645 0 8 DEFAULT

At this moment in time we do know which temporary segment is possibly causing a problem and in which tablespace the segment is located.

Rechecking the SMON tracefile to see which blocks were last visited shows:

…
1: waited for 'db file sequential read'
file#=8, block#=223ce, blocks=1 <…

==> so the likely segment giving a problem in this case is 8.140238

Check the tablespace with dbms_space_admin to dump the bitmap:

$> sqlplus /nolog << EOF
connect / as sysdba
alter session set tracefile_identifier='dbms_space_admin';
exec dbms_space_admin.assm_tablespace_verify(upper('&tablespace_name'),20);
EOF

*tracefile_identifier being used to easy identify the tracefile as created

Above command has created a tracefile located in user_dump_dest, check the tracefile for errors
(dependent of amount of objects in tracefile it might be a large tracefile, but you can use the block# as collected earlier to quickly search for the temporary segments).
Be aware of possible false reported errors, for instance Bug 7665990 which incorrect report:
The datablock [dba: 0x002001bcf, (file 8, block 7119)] has wrong L1 parent [dba: 0x000, (file 0, block 0)]

Example of tracefile:

*** 2009-02-23 11:31:17.716
Examining segment with segment header (8,78)
*********
verifying extent map and tablespace bitmap consistency
———
Verifying consistency of segment header, L3, L2 and L1 bitmap blocks
Verifying datablocks and L1 BMB consistency
———
Verifying Segment HWM Consistency
Verifying High HWM information
Verifying Low HWM information
———
Verifying consistency of hwm information in segment header and L1 bitmap block
———
Verifying extent map and auxilliary extent map consistency in the segment
Verifying extent map and L1 bitmap blocks consistency in the segment
———
Verifying segment header and seg$ information consistency of the segment
———

The temporary segments will NOT have an entry in the tracefile as created!
The check of the tablespace was to check if errors were present and thus excluding other issues.

A rebuild of the bitmap of the tablespace will not solve this issue.

REPAIR
======
Before doing any destructive action always make sure that there is a valid backup to rely on, which can be restored in case needed.

In order to resolve this problem we need to:
- find the relative_fno of the datafile in question
- corrupt the segment(s)
- drop the segment(s)
- rebuild the bitmap of the tablespace
- recheck the bitmap of the tablespace
- check if temporary segments have been cleaned up
- restart the database without event 10061

ActionPlan:

find the relative_fno
SQL> select relative_fno from dba_data_files where file_id=&file_id;

corrupt segment(s)
SQL> exec dbms_space_admin.segment_corrupt(upper('&tablespace_name'),&header_relative_file,&header_block,3);
…

drop segment(s)
SQL> exec dbms_space_admin.segment_drop_corrupt(upper('&tablespace_name'),&header_relative_file,&header_block);
…

rebuild the bitmap
SQL> exec dbms_space_admin.tablespace_rebuild_bitmaps(upper('&tablespace_name'));
==> check the tracefile as created!!

check if temporary segments are all cleaned up:
SQL> select * from dba_segments where segment_type='TEMPORARY';

restart the database without event 10061 so without PFILE:
SQL> shutdown immediate
SQL> startup

System is up and running again!

*In the specific TAR both temporary segments have been dropped using above steps, according the SMON tracefile the problem was with segment 8.140238.
I do advice to drop the segment as identified only and check if problem has been resolved, SMON will cleanup the other segment automatical in case it is not corrupt.

这里我采取的是保险的方法，找出select owner,segment_name from dba_segments where segment_type='TEMPORARY';

找到对应文件，然后让应用移走这个表空间的内容，然后干掉这个表空间。

最后重启数据库，取消10061事件，数据库恢复。