IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    [原]GreenPlum 可读写外部表 实战

    mchdba发表于 2017-05-19 17:30:42
    love 0

    Greenplum的gpload工具通过可读外部表和并行化文件服务器gpfdist(或gpfdists)来加载数据。gpload处理并行化的基于文件的外部表设置,以及允许我们使用单个YAML文件来配置数据格式,外部表定义,以及gpfdist或gpfdists。


    要使用gpload工具有几个前提条件必须满足:
    1.    使用gpload的服务器必须安装Python 2.6.2或者以上版本,pygresql工具和pyyaml工具(数据库服务端已经安装了python和需要的python库文件)
    2.    必须装gpfdist程序,并把它设置到环境变量PATH中(可以从数据库服务器端的安装目录的子目录bin中拷贝该工具)
    3.    确保使用gpload工具的ETL服务器与Greenplum所有服务器的联通性,以及机器名解析正确


    gpload通过它的控制文件(YAML格式控制文件)指定装载的细节信息。所以编写控制文件是用好gpload的关键所在。gpload的控制文件采用YAML1.0.0.1文档格式编写,因此它必须得是有效的YAML格式。

     

    查看gpload参数:

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    [gpadmin@db_m2_slave1 ~]$ gpload -?

    gpload [options] -f configuration file

     

    Options:

        -h hostname: host to connect to

        -p port: port to connect to

        -U username: user to connect as

        -d database: database to connect to

        -W: force password authentication

        -q: quiet mode

        -D: do not actually load data

        -v: verbose

        -V: very verbose

        -l logfile: log output to logfile

        --no_auto_trans: do not wrap gpload in transaction

        --gpfdist_timeout timeout: gpfdist timeout value

        --version: print version number and exit

        -?: help

     

    [gpadmin@db_m2_slave1 ~]$                                                       

     

    创建序列

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    create SEQUENCE gpload_audit_seq INCREMENT BY 1 MINVALUE 1 NO MAXVALUE START WITH 1;
    select * from gpload_audit_seq ;

     

     

    创建审计表,记录每一次gpload执行的开始和结束事件信息

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    create table gpload_audit(                                                               
        id bigint,
        state text,
        mode text,
        tablename text,
        updatetime timestamp
    ) distributed by (id);

     

     

    创建gpload加载的表

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    CREATE TABLE expenses
    (
        name text,
        amount numeric,
        category text,
        des text,
        update_date date
    );

     

    建立外部表:

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    create writable  external table t02 (

    id integer,

    name varchar(128)

    )

    location (

    'gpfdist://101.254.31.72:8090/gpextdata/t21.txt',

    'gpfdist://101.254.31.72:8090/gpextdata/t22.txt'

    )

    Format 'TEXT' (delimiter as E'|' null as '' escape 'OFF')

    ;

     

     

    创建gpload.yml配置文件

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

     [gpadmin@db_m2_slave1 ~]$ more gpload.yml

    --- 

    VERSION: 1.0.0.1 

    DATABASE: yueworld_db 

    USER: mch 

    HOST: 10.254.2.111 

    PORT: 5432 

    GPLOAD: 

      INPUT: 

        - SOURCE: 

            LOCAL_HOSTNAME: 

              - 10.254.3.72 

            PORT: 8090 

            FILE: 

              - /home/gpadmin/gpdextdata/t01.txt

              - /home/gpadmin/gpdextdata/t02.txt

                  

        - COLUMNS: 

            - id: int 

            - name: text

        - FORMAT: text 

        - DELIMITER: ',' 

        - ERROR_LIMIT: 25 

      OUTPUT: 

        - TABLE: t01

        - MODE: INSERT 

      SQL:

        - BEFORE: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'start', 'insert', 'zhangyun_schema.expenses', current_timestamp)"

        - AFTER: "INSERT INTO gpload_audit VALUES(nextval('gpload_audit_seq'), 'end', 'insert', 'zhangyun_schema.expenses', current_timestamp)"

    [gpadmin@db_m2_slave1 ~]$

     

    在greenplum db库上赋予账号权限:

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    yueworld_db=# grant select,insert on gpload_audit to mch;

    GRANT

    yueworld_db=# grant select,insert,update,delete on t01 to mch;

    GRANT

    yueworld_db=#

    yueworld_db=# grant select,usage,update on gpload_audit_seq to mch;

    GRANT

    yueworld_db=# grant all on table t02 to mch;

    GRANT

    yueworld_db=#

     

     

     

    去gpload服务器上,执行gpload:

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    [gpadmin@db_m2_slave1 ~]$ gpload -f g2.yml

    2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30

    2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02'

    2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30

    2017-05-15 17:52:30|INFO|running time: 0.29 seconds

    2017-05-15 17:52:30|INFO|rows Inserted          = 4

    2017-05-15 17:52:30|INFO|rows Updated           = 0

    2017-05-15 17:52:30|INFO|data formatting errors = 0

    2017-05-15 17:52:30|INFO|gpload succeeded

    [gpadmin@db_m2_slave1 ~]$

     

     

    在greenplum服务器上查询

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    yueworld_db=# select * from t02;

    ERROR:  it is not possible to read from a WRITABLE external table.

    HINT:  Create the table as READABLE instead

    yueworld_db=#

     

     

    奇怪的是,当你再执行一次gpload -f y2.yml的时候,会持续往外部表里面写数据,而且是将原有的数据重新写入到t02里面去,而且执行第三次的时候,这个rows Inserted会变成16,如下看执行过程:

    [gpadmin@db_m2_slave1 ~]$ gpload -f g2.yml

    2017-05-15 17:52:30|INFO|gpload session started 2017-05-15 17:52:30

    2017-05-15 17:52:30|INFO|setting schema 'public' for table 't02'

    2017-05-15 17:52:30|INFO|started gpfdist -p 8090 -P 8091 -f "/data/greenplum/gpextdata/t21.txt /data/greenplum/gpextdata/t22.txt" -t 30

    2017-05-15 17:52:30|INFO|running time: 0.29 seconds

    2017-05-15 17:52:30|INFO|rows Inserted          = 8

    2017-05-15 17:52:30|INFO|rows Updated           = 0

    2017-05-15 17:52:30|INFO|data formatting errors = 0

    2017-05-15 17:52:30|INFO|gpload succeeded

    [gpadmin@db_m2_slave1 ~]$

     



沪ICP备19023445号-2号
友情链接