Greenplum的gpload工具通过可读外部表和并行化文件服务器gpfdist(或gpfdists)来加载数据。gpload处理并行化的基于文件的外部表设置,以及允许我们使用单个YAML文件来配置数据格式,外部表定义,以及gpfdist或gpfdists。
要使用gpload工具有几个前提条件必须满足:
1. 使用gpload的服务器必须安装Python 2.6.2或者以上版本,pygresql工具和pyyaml工具(数据库服务端已经安装了python和需要的python库文件)
2. 必须装gpfdist程序,并把它设置到环境变量PATH中(可以从数据库服务器端的安装目录的子目录bin中拷贝该工具)
3. 确保使用gpload工具的ETL服务器与Greenplum所有服务器的联通性,以及机器名解析正确
gpload通过它的控制文件(YAML格式控制文件)指定装载的细节信息。所以编写控制文件是用好gpload的关键所在。gpload的控制文件采用YAML1.0.0.1文档格式编写,因此它必须得是有效的YAML格式。
查看gpload参数:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [gpadmin@db_m2_slave1 ~]$ gpload -? gpload [options] -f configuration file
Options: -h hostname: host to connect to -p port: port to connect to -U username: user to connect as -d database: database to connect to -W: force password authentication -q: quiet mode -D: do not actually load data -v: verbose -V: very verbose -l logfile: log output to logfile --no_auto_trans: do not wrap gpload in transaction --gpfdist_timeout timeout: gpfdist timeout value --version: print version number and exit -?: help
[gpadmin@db_m2_slave1 ~]$ |
创建序列
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- create SEQUENCE gpload_audit_seq INCREMENT BY 1 MINVALUE 1 NO MAXVALUE START WITH 1; |
创建审计表,记录每一次gpload执行的开始和结束事件信息
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- create table gpload_audit( |
创建gpload加载的表
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- CREATE TABLE expenses |
建立外部表:
create writable external table t02 ( id integer, name varchar(128) ) location ( 'gpfdist://101.254.31.72:8090/gpextdata/t21.txt', 'gpfdist://101.254.31.72:8090/gpextdata/t22.txt' ) Format 'TEXT' (delimiter as E'|' null as '' escape 'OFF') ; |
创建gpload.yml配置文件
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [gpadmin@db_m2_slave1 ~]$ more gpload.yml --- VERSION: 1.0.0.1 DATABASE: yueworld_db USER: mch HOST: 10.254.2.111 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - 10.254.3.72 PORT: 8090 FILE: - /home/gpadmin/gpdextdata/t01.txt - /home/gpadmin/gpdextdata/t02.txt
- COLUMNS: - id: int - name: text - FORMAT: text - DELIMITER: ',' - ERROR_LIMIT: 25 OUTPUT: - TABLE: t01 - MODE: INSERT SQL: - BEFORE: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'start', 'insert', 'zhangyun_schema.expenses', current_timestamp)" - AFTER: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'end', 'insert', 'zhangyun_schema.expenses', current_timestamp)" [gpadmin@db_m2_slave1 ~]$ |
在greenplum db库上赋予账号权限:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- yueworld_db=# grant select,insert on gpload_audit to mch; GRANT yueworld_db=# grant select,insert,update,delete on t01 to mch; GRANT yueworld_db=# yueworld_db=# grant select,usage,update on gpload_audit_seq to mch; GRANT yueworld_db=# grant all on table t02 to mch; GRANT yueworld_db=# |
去gpload服务器上,执行gpload:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [gpadmin@db_m2_slave1 ~]$ gpload -f g2.yml 2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30 2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02' 2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30 2017-05-15 17:52:30|INFO|running time: 0.29 seconds 2017-05-15 17:52:30|INFO|rows Inserted = 4 2017-05-15 17:52:30|INFO|rows Updated = 0 2017-05-15 17:52:30|INFO|data formatting errors = 0 2017-05-15 17:52:30|INFO|gpload succeeded [gpadmin@db_m2_slave1 ~]$ |
在greenplum服务器上查询
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- yueworld_db=# select * from t02; ERROR: it is not possible to read from a WRITABLE external table. HINT: Create the table as READABLE instead yueworld_db=# |
奇怪的是,当你再执行一次gpload -f y2.yml的时候,会持续往外部表里面写数据,而且是将原有的数据重新写入到t02里面去,而且执行第三次的时候,这个rows Inserted会变成16,如下看执行过程:
[gpadmin@db_m2_slave1 ~]$ gpload -f g2.yml 2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30 2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02' 2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30 2017-05-15 17:52:30|INFO|running time: 0.29 seconds 2017-05-15 17:52:30|INFO|rows Inserted = 8 2017-05-15 17:52:30|INFO|rows Updated = 0 2017-05-15 17:52:30|INFO|data formatting errors = 0 2017-05-15 17:52:30|INFO|gpload succeeded [gpadmin@db_m2_slave1 ~]$ |