IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    [原]Oozie4.2.0配置安装实战

    fansy1990发表于 2016-01-23 23:44:39
    love 0

    软件版本:

    Oozie4.2.0,Hadoop2.6.0,Spark1.4.1,Hive0.14,Pig0.15.0,Maven3.2,JDK1.7,zookeeper3.4.6,HBase1.1.2,MySQL5.6

    集群部署:

    node1~4.centos.com     node1~4      192.168.0.31~34          1G*4 内存    1核*4 虚拟机

    node1:NameNode 、ResourceManager;

    node2:SecondaryNameNode、Master、HMaster、HistoryServer、JobHistoryServer

    node3:oozie-server(tomcat)、DataNode、NodeManager、HRegionServer、Worker、QuorumPeerMain

    node4:DataNode、NodeManager、HRegionServer、Worker、Pig client、Hive Client、HiveServer2、QuorumPeerMain、mysql

    1. 编译Oozie4.2.0

    此篇参考 http://oozie.apache.org/docs/4.2.0/DG_QuickStart.html#Building_Oozie 、 http://blog.csdn.net/u014729236/article/details/47188631 

    1.1 编译环境准备

    使用tomcat7,而不是tomcat6的下载地址:
    1)下载压缩包oozie-4.2.0.tar.gz,并解压缩到/usr/local/oozie目录
    2)修改pom.xml
    /usr/local/oozie/oozie-4.2.0/distro/pom.xml
    <get src="http://archive.apache.org/dist/tomcat/tomcat-6    ==>
    <get src="http://archive.apache.org/dist/tomcat/tomcat-7

    3) 修改maven setting.xml ,使用开源中国的库
    <mirror>
          <id>nexus-osc</id>
          <name>OSChina Central</name>                                                                             
          <url>http://maven.oschina.net/content/groups/public/</url>
          <mirrorOf>*</mirrorOf>
    </mirror>

    1.2 编译

    进入oozie解压缩目录,使用下面的命令:
    bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.auth.version=2.6.0 -Ddistcp.version=2.6.0 -Dspark.version=1.4.1 -Dpig.version=0.15.0 -Dtomcat.version=7.0.52 
    
    如果加入了hbase或者hive,并且指定到较高版本,则会出错,如:
    #bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.auth.version=2.6.0 -Ddistcp.version=2.6.0 -Dspark.version=1.4.1 -Dpig.version=0.15.0 -Dtomcat.version=7.0.52 #-Dhive.version=0.14.0 -Dhbase.version=1.1.2 ## 指定hive和hbase到较高版本编译通不过

    1.3 修改HDFS配置:

     修改hadoop core-site.xml,内容如下:
    <property>
        <name>hadoop.proxyuser.[USER].hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.[USER].groups</name>
        <value>*</value>
      </property>
    其中,[USER]需要改为后面启动oozie tomcat的用户
    不重启hadoop集群,而使配置生效
    hdfs dfsadmin -refreshSuperUserGroupsConfiguration
      yarn rmadmin -refreshSuperUserGroupsConfiguration
      

    1.4 配置Oozie

    (由于是在node3上部署oozie,所以把下面的压缩包拷贝到node3上)

    1) 取得压缩包:
    oozie-4.2.0/distro/target/oozie-4.2.0-distro.tar.gz
    2) 解压缩:
    tar -zxf oozie-4.2.0-distro.tar.gz

    3)在oozie-4.2.0目录下新建libext目录,并把
    ext-2.2.zip 拷贝到该目录下;
    并拷贝hadoop相关jar包到该目录下
    cp $HADOOP_HOME/share/hadoop/*/*.jar libext/
    cp $HADOOP_HOME/share/hadoop/*/lib/*.jar libext/

    把hadoop与tomcat冲突jar包去掉
    mv servlet-api-2.5.jar servlet-api-2.5.jar.bak
    mv jsp-api-2.1.jar jsp-api-2.1.jar.bak
    mv jasper-compiler-5.5.23.jar jasper-compiler-5.5.23.jar.bak
    mv jasper-runtime-5.5.23.jar jasper-runtime-5.5.23.jar.bak

    拷贝mysql驱动到该目录下(使用mysql数据库,默认是derby)
    scp mysql-connector-java-5.1.25-bin.jar node3:/usr/oozie/oozie-4.2.0/libext/

    4)配置数据库连接,文件是conf/oozie-site.xml
    <property>
        <name>oozie.service.JPAService.create.db.schema</name>
        <value>true</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.driver</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.url</name>
        <value>jdbc:mysql://node4:3306/oozie?createDatabaseIfNotExist=true</value>
    </property>
    
    <property>
        <name>oozie.service.JPAService.jdbc.username</name>
        <value>root</value>
    </property>
    
    <property>
        <name>oozie.service.JPAService.jdbc.password</name>
        <value>root</value>
    </property>
    <property>
        <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
        <value>*=/usr/hadoop/hadoop-2.6.0/etc/hadoop</value>
    </property>
    


    最后一个配置,是需要配置的,不然后面运行调度的时候,任务会报File /user/root/share/lib does not exist 的错误


    5)启动前的初始化
    a. 打war包  
    bin/oozie-setup.sh prepare-war

    b. 初始化数据库
    bin/ooziedb.sh create -sqlfile oozie.sql -run


    c. 修改oozie-4.2.0/oozie-server/conf/server.xml文件,注释掉下面的记录
    <!--<Listener className="org.apache.catalina.mbeans.ServerLifecycleListener" />-->

    d. 上传jar包
    bin/oozie-setup.sh sharelib create -fs hdfs://node1:8020 

    1.5 启动

    bin/oozied.sh start



    2. 流程实例

    数据为:bank.csv ,并已经上传到hdfs://node1:8020/user/root/bank.csv ,可以在http://zeppelin-project.org/docs/tutorial/tutorial.html页面下载该数据
    (当执行Hive、Pig任务的时候需要把第一行数据删除)
    默认所有操作用户都是root,如果是其他用户,可能需要修改对应的目录
    配置环境变量:export OOZIE_URL=http://node3:11000/oozie

    2.1 MR任务流程

    1. job.properties :
    oozie.wf.application.path=hdfs://node1:8020/user/root/workflow/mr_demo/wf
    #Hadoop"R
    jobTracker=node1:8032
    #Hadoop"fs.default.name
    nameNode=hdfs://node1:8020/
    #Hadoop"mapred.queue.name
    queueName=default
    

    2. workflow.xml
    <workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
        <start to="mr-node"/>
        <action name="mr-node">
            <map-reduce>
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <prepare>
                    <delete path="${nameNode}/user/${wf:user()}/workflow/mr_demo/output"/>
                </prepare>
                <configuration>
                    <property>
                        <name>mapred.job.queue.name</name>
                        <value>${queueName}</value>
                    </property>
                    <property>
                        <name>mapreduce.mapper.class</name>
                        <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>
                    </property>
                    <property>
                        <name>mapreduce.reducer.class</name>
                        <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
                    </property>
                    <property>
                        <name>mapred.map.tasks</name>
                        <value>1</value>
                    </property>
                    <property>
                        <name>mapred.input.dir</name>
                        <value>/user/${wf:user()}/bank.csv</value>
                    </property>
                    <property>
                        <name>mapred.output.dir</name>
                        <value>/user/${wf:user()}/workflow/mr_demo/output</value>
                    </property>
                </configuration>
            </map-reduce>
            <ok to="end"/>
            <error to="fail"/>
        </action>
        <kill name="fail">
            <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
        <end name="end"/>
    </workflow-app>
    

    3. 运行:
    1)拷贝workflow.xml文件到HDFS的 hdfs://node1:8020/user/root/workflow/mr_demo/wf/workflow.xml 目录;
    2)在node3(node3既作为oozie的server也作为client)上运行 bin/oozie job -config job.properties -run ,即可提交任务,提交任务后会返回一个jobId ,例如:
    0000004-160123180442501-oozie-root-W
    3) 使用 bin/oozie job -info 0000004-160123180442501-oozie-root-W 即可查看流程状态;
    4) 流程结束后,查看流程状态以及在对应的目录即可查看输出结果;

    2.2 Pig任务流程

    1. job.properties
    oozie.wf.application.path=hdfs://node1:8020/user/root/workflow/pig_demo/wf
    oozie.use.system.libpath=true #pig流程必须配置此选项
    #Hadoop"ResourceManager
    resourceManager=node1:8032
    #Hadoop"fs.default.name
    nameNode=hdfs://node1:8020/
    #Hadoop"mapred.queue.name
    queueName=default

    2.  workflow.xml
    <workflow-app xmlns="uri:oozie:workflow:0.2"
    name="whitehouse-workflow">
    <start to="transform_job"/>
    	<action name="transform_job">
    		<pig>
    			<job-tracker>${resourceManager}</job-tracker>
    			<name-node>${nameNode}</name-node>
    			<prepare>
    				<delete path="/user/root/workflow/pig_demo/output"/>
    			</prepare>
    			<script>transform_job.pig</script>
    		</pig>
    		<ok to="end"/>
    		<error to="fail"/>
    	</action>
    	<kill name="fail">
    		<message>Job failed, error
    			message[${wf:errorMessage(wf:lastErrorNode())}]
    		</message>
    	</kill>
    	<end name="end"/>
    </workflow-app>
    

    3 . transform_job.pig pig任务用到的脚本
    bank_data= LOAD '/user/root/bank.csv' USING PigStorage(';') AS
    (age:int, job:chararray, marital:chararray,education:chararray,
     default:chararray,balance:int,housing:chararray,loan:chararray,
    contact:chararray,day:int,month:chararray,duration:int,campaign:int,
    pdays:int,previous:int,poutcom:chararray,y:chararray);
    
    age_gt_30 = FILTER bank_data BY age >= 30;
    
    store age_gt_30 into '/user/root/workflow/pig_demo/output' using PigStorage(',');
    
    4. 运行
    1) 把 transform_job.pig ,workflow.xml 文件拷贝到 hdfs://node1:8020/user/root/workflow/pig_demo/wf/ 目录下面
    2) 运行 bin/oozie job -config job.properties -run 
    3) 运行 bin/oozie job -info jobId 查看对应任务的进度状态,或者在浏览器中的node3:11000 URL中查看所有任务;

    2.3 Hive任务流程

    注意:hive 任务运行完成后,bank.csv文件会被删除(应该是移动到hive的warehouse目录下),所以进行其他或者再次运行时需要重新上传文件
    1. job.properties
    nameNode=hdfs://node1:8020
    jobTracker=node1:8032
    queueName=default
    maxAge=30
    input=/user/root/bank.csv
    output=/user/root/workflow/hive_demo/output
    oozie.use.system.libpath=true
    
    oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/hive_demo/wf
    2. workflow.xml
    <workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
        <start to="hive-node"/>
    
        <action name="hive-node">
            <hive xmlns="uri:oozie:hive-action:0.2">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <prepare>
                    <delete path="${output}/hive"/>
                    <mkdir path="${output}"/>
                </prepare>
                <configuration>
                    <property>
                        <name>mapred.job.queue.name</name>
                        <value>${queueName}</value>
                    </property>
                </configuration>
                <script>script.hive</script>
                <param>INPUT=${input}</param>
                <param>OUTPUT=${output}/hive</param>
           		<param>maxAge=${maxAge}</param>
    	 </hive>
            <ok to="end"/>
            <error to="fail"/>
        </action>
    
        <kill name="fail">
            <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
        <end name="end"/>
    </workflow-app>
    
    3. hive任务用到的脚本 script.hive
    DROP TABLE IF EXISTS bank;
    
    CREATE TABLE bank(
    	age int,
    	job string,
    	marital string,education string,
     default string,balance int,housing string,loan string,
    contact string,day int,month string,duration int,campaign int,
    pdays int,previous int,poutcom string,y string
    ) 
     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
     STORED AS TEXTFILE;
    
     LOAD DATA INPATH '${INPUT}' INTO TABLE bank;
    
    INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM bank where age > '${maxAge}';
    注意:‘\073’ 代表分号;
    4. 运行,参考上面


    2.4 Hive 2 任务流程

    1. job.properties 
    nameNode=hdfs://node1:8020
    jobTracker=node1:8032
    queueName=default
    jdbcURL=jdbc:hive2://node4:10000/default # hiveserver2 时,配置此选项
    maxAge=30
    input=/user/root/bank.csv
    output=/user/root/workflow/hive2_demo/output
    oozie.use.system.libpath=true
    
    oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/hive2_demo/wf

    2. workflow.xml 
    <workflow-app xmlns="uri:oozie:workflow:0.5" name="hive2-wf">
        <start to="hive2-node"/>
    
        <action name="hive2-node">
            <hive2 xmlns="uri:oozie:hive2-action:0.1">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <prepare>
                    <delete path="${output}/hive"/>
                    <mkdir path="${output}"/>
                </prepare>
                <configuration>
                    <property>
                        <name>mapred.job.queue.name</name>
                        <value>${queueName}</value>
                    </property>
                </configuration>
    
    	    <jdbc-url>${jdbcURL}</jdbc-url>
                <script>script2.hive</script>
                <param>INPUT=${input}</param>
                <param>OUTPUT=${output}/hive</param>
           		<param>maxAge=${maxAge}</param>
    	 </hive2>
            <ok to="end"/>
            <error to="fail"/>
        </action>
    
        <kill name="fail">
            <message>Hive2 failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
        <end name="end"/>
    </workflow-app>
    
    3. hive2用到的脚本: script2.hive
    DROP TABLE IF EXISTS bank2;
    
    CREATE TABLE bank2(
    	age int,
    	job string,
    	marital string,education string,
     default string,balance int,housing string,loan string,
    contact string,day int,month string,duration int,campaign int,
    pdays int,previous int,poutcom string,y string
    ) 
     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
     STORED AS TEXTFILE;
    
     LOAD DATA INPATH '${INPUT}' INTO TABLE bank2;
    
    INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM bank2 where age > '${maxAge}';
    

    4. 运行,参考上面

    2.5 Spark 任务流程

    1. job.properties :
    nameNode=hdfs://node1:8020
    jobTracker=node1:8032
    #master=spark://node2:7077 
    master=spark://node2:6066
    sparkMode=cluster
    queueName=default
    oozie.use.system.libpath=true
    input=/user/root/bank.csv
    output=/user/root/workflow/spark_demo/output
    # the jar file must be local
    jarPath=${nameNode}/user/root/workflow/spark_demo/lib/oozie-examples.jar
    oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/spark_demo/wf
    由于sparkMode采用cluster,所以master的链接需要是下面的6066,:
    sparkMode使用client没有试验成功;

    2. workflow.xml
    <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
        <start to='spark-node' />
    
        <action name='spark-node'>
            <spark xmlns="uri:oozie:spark-action:0.1">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <prepare>
                    <delete path="${output}"/>
                </prepare>
                <master>${master}</master>
            <mode>${sparkMode}</mode>   
                <name>Spark-FileCopy</name>
    	 <class>org.apache.oozie.example.SparkFileCopy</class>
                <jar>${jarPath}</jar>
                <arg>${input}</arg>
                <arg>${output}</arg>
            </spark>
            <ok to="end" />
            <error to="fail" />
        </action>
    
        <kill name="fail">
            <message>Workflow failed, error
                message[${wf:errorMessage(wf:lastErrorNode())}]
            </message>
        </kill>
        <end name='end' />
    </workflow-app>
    

    3. 运行:
    1) 这里用到的oozie-examples.jar 是在oozie-examples.tar.gz解压后的examples/apps/spark/lib目录下面
    2) 上传oozie-examples.jar 到hdfs://node1:8020/user/root/workflow/spark_demo/lib/oozie-examples.jar 目录;上传workflow.xml到hdfs://node1:8020/user/root/workflow/spark_demo/wf/workflow.xml文件;
    3) bin/oozie job -config job.properties -run 即可运行;

    4. 相关问题:
    1) 这种方式提交任务是通过yarn开启任务,然后提交到spark集群运行的,并不是直接由spark集群运行的,如下图:
    首先在8088 界面看到yarn开启的任务:

    接着去spark监控界面,同样可以看到监控界面:


    但是这样时间就不对了,看日志:

    可以看到连接到了yarn的resourcemanager后,直接就连接了spark的master了,然后提交了任务,接着就直接yarn的任务就successed了,然后yarn就返回了;
    查看spark的日志,时间也是吻合的:

    最后保存文件,关闭driver:

    2.6  spark on yarn任务流程

    参考官网的提示:


    1. job.properties:
    nameNode=hdfs://node1:8020
    jobTracker=node1:8032
    #master=spark://node2:7077
    #master=spark://node2:6066
    master=yarn-cluster
    #sparkMode=cluster
    queueName=default
    oozie.use.system.libpath=true
    input=/user/root/bank.csv
    output=/user/root/workflow/sparkonyarn_demo/output
    
    jarPath=${nameNode}/user/root/workflow/sparkonyarn_demo/lib/oozie-examples.jar
    oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/sparkonyarn_demo
    2. workflow.xml:
    <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy_on_yarn'>
        <start to='spark-node' />
    
        <action name='spark-node'>
            <spark xmlns="uri:oozie:spark-action:0.1">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <prepare>
                    <delete path="${output}"/>
                </prepare>
                <master>${master}</master>
                <name>Spark-FileCopy-on-yarn</name>
    	 <class>org.apache.oozie.example.SparkFileCopy</class>
                <jar>${jarPath}</jar>
                <spark-opts>--conf spark.yarn.historyServer.address=http://node2:18080 --conf spark.eventLog.dir=hdfs://node1:8020/spark-log --conf spark.eventLog.enabled=true</spark-opts>
    		<arg>${input}</arg>
                <arg>${output}</arg>
            </spark>
            <ok to="end" />
            <error to="fail" />
        </action>
    
        <kill name="fail">
            <message>Workflow failed, error
                message[${wf:errorMessage(wf:lastErrorNode())}]
            </message>
        </kill>
        <end name='end' />
    </workflow-app>

    3. 运行;
    1)环境准备:拷贝workflow.xml 到hdfs;//node1:8020/user/root/workflow/sparkonyarn_demo/workflow.xml文件
    2)拷贝oozie-exmaples.jar 到 hdfs;//node1:8020/user/root/workflow/sparkonyarn_demo/lib/oozie-examples.jar文件
    3)拷贝$SPARK_HOME/lib/spark-assembly-1.4.1-hadoop2.6.0.jar文件到hdfs;//node1:8020/user/root/workflow/sparkonyarn_demo/lib/spark-assembly-1.4.1-hadoop2.6.0.jar 
    4) bin/oozie job -config job.properties -run 
    5) 查看任务状态:

    4. 相关问题

    1) spark 提交和spark on yarn 方式的区别:
    spark on yarn也是使用yarn来提交任务,但是没有spark的任务,全部在yarn上运行,看日志的区别:
    在8088的区别:


    0000003-160123180442501-oozie-root-W任务前后只有一个,并且有一个spark的任务(node2:8080),对照时间
    spark on yarn的方式

    看到0000009-160123180442501-oozie-root-W 这个任务其实是有两个yarn的任务组成的

    查看oozie的日志监控:

    所以spark 的方式是yarn启动任务,然后由spark集群运行任务,然后结束;中间需要spark集群启动(也需要yarn集群启动)
    而spark on yarn的方式则是yarn启动任务A ,然后在任务中调用另外一个yarn任务B,当任务B完成后,再返回到任务A,最后任务A结束。中间不需要spark集群启动(这个看下图就知道了)






    分享,成长,快乐

    脚踏实地,专注

    转载请注明blog地址:http://blog.csdn.net/fansy1990




沪ICP备19023445号-2号
友情链接