我写关于12.2 sharding database的文章已经好久了,今天再次把环境启动了起来,启动了主机之后,依次启动了listener和shardcat数据库和shard node数据库实例。检查shard状态的时候,发现报错:
(a)问题一:status显示warnings:
GDSCTL>config shard Name Shard Group Status State Region Availability ---- ----------- ------ ----- ------ ------------ sh1 primary_shardgroup Ok Deployed region1 ONLINE sh2 primary_shardgroup Warnings Deployed region1 - GDSCTL>
遇到这样的情况,我们可以使用recover shard的命令。注意recover shard -h出来的帮助是错误的。
GDSCTL>recover shard -h Syntax RECOVER SHARD -gdspool pool -database db_name [-skip_first|-ignore_first] [-full] Purpose Executes all DDL statements on specified database starting from the one, that was previusly executed with errors. The command is intended to perform all skipped DDL changes after database administrator fixes shard issues. Usage Notes Use SKIP_FIRST to skip first DDL. This is typically required after manual fix done by database administrator. For example, if CREATE TABLE statement fails because of lack of space, database administrator fixes the issue and re-executes CREATE TABLE. To avoid ORA-39151 (table exists)in RECOVER SHARD he has to specify -SKIP_FIRST. Use IGNORE_FIRST to mark first DDL as obsolete. This is required when wrong DDL statement was specified and failed on all shards. Thus you need to mark it down as obsolete. Keywords and Parameters full: Full recovery mode. gdspool: the GDS pool (If not specified and there is only one gdspool with access granted to user, it will be used by default) ignore_first: make first failed DDL statement obsolete. shard: The name of the shard. skip_first: skip first failed DDL statement Examples RECOVER SHARD -database shd1 GDSCTL>
正确的用法应该是recover shard -shard 。
我们运行这个命令,发现是gsm没有启动:
GDSCTL>recover shard -shard sh2 GSM-45076: GSM is not running GDSCTL> GDSCTL>start gsm GSM is started successfully GDSCTL>
启动后,稍等片刻(大约10秒),检查状态就恢复成正常了(在这里,我们还没有真正用到recover shard,在下面一个问题时才正常用到):
GDSCTL>config shard Name Shard Group Status State Region Availability ---- ----------- ------ ----- ------ ------------ sh1 primary_shardgroup Ok Deployed region1 ONLINE sh2 primary_shardgroup Warnings Deployed region1 - GDSCTL> GDSCTL>--等待大约10秒 GDSCTL> GDSCTL>config shard Name Shard Group Status State Region Availability ---- ----------- ------ ----- ------ ------------ sh1 primary_shardgroup Ok Deployed region1 ONLINE sh2 primary_shardgroup Ok Deployed region1 ONLINE GDSCTL>
所以,重新启动之后,需要启动的有database,listener,和gsm。
(b)问题二:state显示ddl error
然后,在接着的实验中,我drop掉shardcat上的一个duplicate table,发现此时shard状态又不正常了。config shard的state显示ddl error。
首先,在shardcat上操作,drop掉duplicate表
SQL> conn app_schema/oracle Connected. SQL> SQL> drop table products; drop table products * ERROR at line 1: ORA-02557: cannot operate on sharded objects when shard DDL is disabled SQL> SQL> SQL> SQL> alter session enable shard ddl; Session altered. SQL> select count(*) from products; COUNT(*) ---------- 1000 SQL> SQL> drop table products; Table dropped. SQL>
发现此时数据并没有同步,在shard node上,还是可以查询到数据:
SQL> !hostname sdb2 SQL> SQL> conn app_schema/oracle Connected. SQL> SQL> select count(*) from products; COUNT(*) ---------- 1000 SQL> SQL> !hostname sdb3 SQL> SQL> conn app_schema/oracle Connected. SQL> SQL> select count(*) from products; COUNT(*) ---------- 1000 SQL>
此时检查shard状态,变成DDL error:
GDSCTL>config shard Name Shard Group Status State Region Availability ---- ----------- ------ ----- ------ ------------ sh1 primary_shardgroup Ok DDL error region1 ONLINE sh2 primary_shardgroup Ok DDL error region1 ONLINE GDSCTL>
我们可以在gsm的alertlog中看到一些提示信息:
(注:gsm的日志位置在/u01/ora12c/app/oracle/diag/gsm/sdb1/sharddirector1/trace/alert_shardcat.log)
…… 07-NOV-2016 01:35:19 * (CONNECT_DATA=(SERVICE_NAME=GDS$CATALOG.oradbcloud)(CID=(PROGRAM=oracle)(HOST=sdb3)(USER=oracle12c))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.23)(PORT=34247)) * establish * GDS$CATALOG.oradbcloud * 0 2016-11-07T01:35:22.021482+08:00 07-NOV-2016 01:35:22 * service_update * shardcat%11 * 0 07-NOV-2016 01:35:22 * service_update * shardcat%1 * 0 2016-11-07T01:35:35.543868+08:00 GSM-40135: Catalog request: "DDL REQUEST" (80). Target -"33". Pool- - "". Id="72". Payload:"A" GSM-40135: Catalog request: "DDL REQUEST" (80). Target -"34". Pool- - "". Id="73". Payload:"E" 2016-11-07T01:35:36.652686+08:00 GSM-40148: Database task failed:database:"sh1", status 4, message:"ORA-01031: insufficient privileges ORA-06512: at "SYS.EXECASUSER", line 35 ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434 ORA-06512: at "SYS.EXECASUSER", line 23 ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68 ORA-06512: at line 1 \(ngsmoci_execute\) " GSM-40148: Database task failed:database:"sh2", status 4, message:"ORA-01031: insufficient privileges ORA-06512: at "SYS.EXECASUSER", line 35 ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434 ORA-06512: at "SYS.EXECASUSER", line 23 ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68 ORA-06512: at line 1 \(ngsmoci_execute\) " 2016-11-07T01:35:36.700593+08:00 GSM-40157: GSM response Id=72 Type=80. Status:"Success with info". 2016-11-07T01:35:37.010475+08:00 GSM-40135: Catalog request: "RUNTIME DATABASE" (40). Target -"sh2". Pool- - "". Id="74". Payload:"(PARAMETERS=(DDLID=33)(AUTORCV=0))" GSM-40135: Catalog request: "RUNTIME DATABASE" (40). Target -"sh1". Pool- - "". Id="75". Payload:"(PARAMETERS=(DDLID=33)(AUTORCV=0))" GSM-40157: GSM response Id=73 Type=80. Status:"Success". 2016-11-07T01:35:41.834561+08:00 07-NOV-2016 01:35:41 * service_update * SHARDDIRECTOR1 * 0 ……
其实,从这个log中,我们只是看到有ORA-01031 insufficient privileges的报错,但究竟是什么权限不足,我们还是得通过config shard命令来进一步看:
GDSCTL>config shard -shard sh1 Name: sh1 Shard Group: primary_shardgroup Status: Ok State: Deployed Region: region1 Connection string: sdb2:1521/sh1:dedicated SCAN address: ONS remote port: 0 Disk Threshold, ms: 20 CPU Threshold, %: 75 Version: 12.2.0.0 Last Failed DDL: alter database link "PRODUCTSD... DDL Error: ORA-01031: insufficient privileges ORA-06512: at "SYS.EXECASUSER", line 35 ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434 ORA-06512: at "SYS.EXECASUSER", line 23 ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68 ORA-06512: at line 1 \(ngsmoci_execute\) Failed DDL id: 33 Availability: ONLINE Supported services ------------------------ Name Preferred Status ---- --------- ------ oltp_rw_srvc Yes Enabled GDSCTL>
我们看到这个last failed ddl,是alter database link的时候出现权限不足。
所以我们找到了原因,进行修复:
--在sh1上: [oracle12c@sdb2 ~]$ sqlplus "/ as sysdba" SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:50:25 2016 Copyright (c) 1982, 2015, Oracle. All rights reserved. Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> grant alter database link to app_schema; Grant succeeded. SQL> --在sh2上 [oracle12c@sdb3 ~]$ sqlplus "/ as sysdba" SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:51:13 2016 Copyright (c) 1982, 2015, Oracle. All rights reserved. Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> grant alter database link to app_schema; Grant succeeded. SQL> --在shardcat上,gsm_env下: GDSCTL>config shard Name Shard Group Status State Region Availability ---- ----------- ------ ----- ------ ------------ sh1 primary_shardgroup Ok DDL error region1 ONLINE sh2 primary_shardgroup Ok DDL error region1 ONLINE GDSCTL> GDSCTL>recover shard -shard sh1 GDSCTL> GDSCTL>recover shard -shard sh2 GDSCTL> GDSCTL>config shard Name Shard Group Status State Region Availability ---- ----------- ------ ----- ------ ------------ sh1 primary_shardgroup Ok Deployed region1 ONLINE sh2 primary_shardgroup Ok Deployed region1 ONLINE GDSCTL>
可以看到,在修复问题的基础上,运行recover shard -shard 之后,就恢复了正常。
注意需要注意的是,如果root cause没找到,不知道是database link的权限问题,仅仅运行recover shard是没有效果的。
另外我们也要注意,除了一开始给app_schema的相关权限(一开始赋予的权限,见『创建Oracle sharding database』),我们还要给app_schema于alter database link的权限。