hive transform python

来源:互联网 发布:辩论赛网络利大于弊 编辑:程序博客网 时间:2024/06/04 22:49
自己写的一个简单例子,用来做话题描述去重,表中的desc字段 “a-b-a-b-b-c”需要去重  python代码如下:  #!/usr/bin/python  import sys  reload(sys)  sys.setdefaultencoding('utf8')  def quchong(desc):      a=desc.split('-')      return '-'.join(set(a))  while True:          line = sys.stdin.readline()          if line == "":                  break          line = line.rstrip('\n')          # your process code here          parts = line.split('\t')          parts[2]=quchong(parts[2])          print "\t".join(parts)    下面是转载过来的,比较详细  二、hive map中字段自增的写法(转)    1、建立表结构    hive> CREATE TABLE t3 (foo STRING, bar MAP<STRING,INT>)      > ROW FORMAT DELIMITED      > FIELDS TERMINATED BY '/t'      > COLLECTION ITEMS TERMINATED BY ','      > MAP KEYS TERMINATED BY ':'      > STORED AS TEXTFILE;  OK         2、建成的效果    hive> describe t3;  OK  foo     string  bar     map<string,int>         3、生成test.txt    jeffgeng        click:13,uid:15         4、把test.txt load进来    hive> LOAD DATA LOCAL INPATH 'test.txt' OVERWRITE INTO TABLE t3;  Copying data from file:/root/src/hadoop/hadoop-0.20.2/contrib/hive-0.5.0-bin/bin/test.txt  Loading data to table t3  OK         load完效果如下    hive> select * from t3;  OK  jeffgeng        {"click":13,"uid":15}         5、可以这样查map的值    hive> select bar['click'] from t3;    ...一系列的mapreduce...    OK  13         6、编写add_mapper    #!/usr/bin/python  import sys  import datetime    for line in sys.stdin:      line = line.strip()      foo, bar = line.split('/t')      d = eval(bar)      d['click'] += 1      print '/t'.join([foo, str(d)])         7、在hive中执行    hive> CREATE TABLE t4 (foo STRING, bar MAP<STRING,INT>)      > ROW FORMAT DELIMITED      > FIELDS TERMINATED BY '/t'      > COLLECTION ITEMS TERMINATED BY ','      > MAP KEYS TERMINATED BY ':'      > STORED AS TEXTFILE;         hive> add FILE add_mapper.py         INSERT OVERWRITE TABLE t4      > SELECT      >   TRANSFORM (foo, bar)      >   USING 'python add_mapper.py'      >   AS (foo, bar)      > FROM t3;  FAILED: Error in semantic analysis: line 1:23 Cannot insert into target table because column number/types are different t4: Cannot convert column 1 from string to map<string,int>.         8、为什么会报出以上错误?貌似add_mapper.py的输出是string格式的,hive无法此这种格式的map认出。后查明,AS后边可以为字段强制指定类型    INSERT OVERWRITE TABLE t4  SELECT    TRANSFORM (foo, bar)    USING 'python add_mapper.py'    AS (foo string, bar map<string,int>)  FROM t3;         9、同时python脚本要去除字典转换后遗留下来的空格,引号,左右花排号等    #!/usr/bin/python  import sys  import datetime    for line in sys.stdin:      line = line.strip('/t')      foo, bar = line.split('/t')      d = eval(bar)      d['click'] += 1      d['uid'] += 1      strmap = ''      for x in str(d):          if x in (' ', "'"):              continue          strmap += x      print '/t'.join([foo, strmap])         10、执行后的结果    hive> select * from t4;  OK  jeffgeng        {"click":14,"uid":null}  Time taken: 0.146 seconds 

0 0
原创粉丝点击