给大数据文件的每一行产生唯一的id
来源:互联网 发布:mac双系统删除win7 编辑:程序博客网 时间:2024/04/30 10:07
4个主要思路:
1 单线程处理
2 普通多线程
3 hive
4 Hadoop
搜到一些参考资料
《Hadoop实战》的笔记-2、Hadoop输入与输出
https://book.douban.com/annotation/17068812/
TextInputFormat:文件偏移量:整行数据
但是这个偏移量,貌似是在一个文件的偏移,而不是全局。
Generate Auto-increment Id in Map-reduceJob
http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/
Generate unique customer id / insert uniquerows in hive
http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive
Need to add auto increment column in atable using hive
http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive
https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/
Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.
最后我采取了用hive写udf的方案。
package hive.udf;/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.hive.ql.udf.UDFType;/** * UDFRowSequence. */@Description(name = "row_sequence", value = "_FUNC_() - Returns a generated row sequence number starting from 1")@UDFType(deterministic = false, stateful = true)//stateful参数是必要的public class UDFRowSequence extends UDF{ private int result; public UDFRowSequence() { result=0; } public int evaluate() { result++; return result; }}// End UDFRowSequence.java
本文作者:linger
本文链接:http://blog.csdn.net/lingerlanlan/article/details/46430747
- 给大数据文件的每一行产生唯一的id
- 产生全世界唯一的ID
- MD5加密以及产生唯一的ID
- MD5加密以及产生唯一的ID
- 给文本文件的每一行按顺序标号
- 给GridView列表的每一行加序号。
- C#winform中给datagridview的每一行添加按钮
- 给每一行数据更新一个不同的随机数
- 给每一行的修改加一个事件
- 利用XSLT产生一个唯一的ID并引用它
- 使用UUID可以产生一个号称全球唯一的ID
- 显示每一行的序号
- 数据库每一行的长度
- 生成唯一的ID
- UUID唯一的ID
- STM32唯一的ID
- SQLSever: 如何在select中的每一行产生不同的随机数?
- js在repeater控件内每一行<input type="checkbox"/>实现全选,全不选,然后获取每一行的ID删除
- 20150603指针基本操作及应用
- WWDC 2015大会视频直播地址:看iOS9大展身手
- 第三章
- Netlink实现热拔插监控
- C语言指针数组和数组指针
- 给大数据文件的每一行产生唯一的id
- IntelMP(Intel Multiple Processor)相关结构
- 极大似然估计原理思想
- Uva - 1587 - Box
- 20150604高级指针及函数指针
- 与SAP集成的接口方式比较:IDOC BAPI RFC 之间的区别
- 运算符重载
- JQuery学习笔记
- 20150604高级指针及函数指针_练习