IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    [原]给大数据文件的每一行产生唯一的id

    linger2012liu发表于 2015-06-09 19:42:23
    love 0
    给大数据文件的每一行产生唯一的id

    4个主要思路:

    1 单线程处理

    2 普通多线程

    3 hive

    4 Hadoop

    搜到一些参考资料


    《Hadoop实战》的笔记-2、Hadoop输入与输出

    https://book.douban.com/annotation/17068812/

    TextInputFormat:文件偏移量:整行数据

    但是这个偏移量,貌似是在一个文件的偏移,而不是全局。

    Generate Auto-increment Id in Map-reduceJob

    http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/

    Generate unique customer id / insert uniquerows in hive

    http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive

    Need to add auto increment column in atable using hive

    http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

    https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

    Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.

    最后我采取了用hive写udf的方案。


    package hive.udf;
    /**
     * Licensed to the Apache Software Foundation (ASF) under one
     * or more contributor license agreements.  See the NOTICE file
     * distributed with this work for additional information
     * regarding copyright ownership.  The ASF licenses this file
     * to you under the Apache License, Version 2.0 (the
     * "License"); you may not use this file except in compliance
     * with the License.  You may obtain a copy of the License at
     *
     *     http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
    import org.apache.hadoop.hive.ql.exec.Description;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.hive.ql.udf.UDFType;
    
    /**
     * UDFRowSequence.
     */
    @Description(name = "row_sequence",
        value = "_FUNC_() - Returns a generated row sequence number starting from 1")
    @UDFType(deterministic = false, stateful = true)//stateful参数是必要的
    public class UDFRowSequence extends UDF
    {
      private int result;
    
      public UDFRowSequence() {
        result=0;
      }
    
      public int evaluate() {
    	  result++;
        return result;
      }
    }
    
    // End UDFRowSequence.java

    本文作者:linger

    本文链接:http://blog.csdn.net/lingerlanlan/article/details/46430747





沪ICP备19023445号-2号
友情链接