给大数据文件的每一行产生唯一的id

来源:互联网 发布:mac双系统删除win7 编辑:程序博客网 时间:2024/04/30 10:07
给大数据文件的每一行产生唯一的id

4个主要思路:

1 单线程处理

2 普通多线程

3 hive

4 Hadoop

 

搜到一些参考资料


《Hadoop实战》的笔记-2、Hadoop输入与输出

https://book.douban.com/annotation/17068812/

TextInputFormat:文件偏移量:整行数据

但是这个偏移量,貌似是在一个文件的偏移,而不是全局。

 

Generate Auto-increment Id in Map-reduceJob

http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/

 

Generate unique customer id / insert uniquerows in hive

http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive

 

Need to add auto increment column in atable using hive

http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

 

 

https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.

 

最后我采取了用hive写udf的方案。


package hive.udf;/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements.  See the NOTICE file * distributed with this work for additional information * regarding copyright ownership.  The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.hive.ql.udf.UDFType;/** * UDFRowSequence. */@Description(name = "row_sequence",    value = "_FUNC_() - Returns a generated row sequence number starting from 1")@UDFType(deterministic = false, stateful = true)//stateful参数是必要的public class UDFRowSequence extends UDF{  private int result;  public UDFRowSequence() {    result=0;  }  public int evaluate() {  result++;    return result;  }}// End UDFRowSequence.java

 

本文作者:linger

本文链接:http://blog.csdn.net/lingerlanlan/article/details/46430747



1 0
原创粉丝点击