python连hive

来源:互联网 发布:编程语言特点 编辑:程序博客网 时间:2024/05/29 12:47

前言

最近开始尝试tensorflow,而数据存在hive里,如何读取数据是个很大问题。网上有不少例子,不过这些例子也带来了不少坑,最终于经过九九八十一难,总算连成功了。
先说说经历的坑吧

坑1
使用thrift安装,在网上查了很多相关资料,貌似年代都挺久远的,不过还是试了试,参考代码如下

# -*- coding: utf-8 -*-#!/usr/bin/env pythonimport sysimport tracebacksys.path.append('/home/hadoop/hive/lib/py')from hive_service import ThriftHivefrom hive_service.ttypes import HiveServerExceptionfrom thrift import Thriftfrom thrift.transport import TSocketfrom thrift.transport import TTransportfrom thrift.protocol import TBinaryProtocoldef hiveExe(sql):    try:        transport = TSocket.TSocket('127.0.0.1', 10000)         transport = TTransport.TBufferedTransport(transport)        protocol = TBinaryProtocol.TBinaryProtocol(transport)        client = ThriftHive.Client(protocol)        transport.open()        client.execute(sql)        print "The return value is : "         print client.fetchAll()        print "............"        transport.close()    except Thrift.TException, tx:    traceback.print_exc()        print 'exception %s' % (tx.message)if __name__ == '__main__':    hiveExe("select * from temp.")

用pip install 装好了thrift,但hive_service却不知如何装了,无法pip install, 也未Google到,只好放弃。

坑2
无意在github上发现了pythive这个神奇的东西,感觉瞬间看到了光明。参考代码如下

from pyhive import hivefrom TCLIService.ttypes import TOperationStatecursor = hive.connect('localhost').cursor()cursor.execute('SELECT * FROM my_awesome_data LIMIT 10', async=True)status = cursor.poll().operationStatewhile status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):    logs = cursor.fetch_logs()    for message in logs:        print message    # If needed, an asynchronous query can be cancelled at any time with:    # cursor.cancel()    status = cursor.poll().operationStateprint cursor.fetchall()

然而坑又来了,当from pyhive import hive时又报错了,报错信息为 No module named sasl,然后再pip install sasl ,又出现报错,没办法,只好根据报错一个个解决了。依稀记得得装python-dev 还有gcc相关的东西升级下,太多东西,记不清了。
装好sasl后,又无法读取数据,错误代码为='TSocket read 0 bytes。去github里的Issus聊了下,也无法得到比较好的解决,于是换了段代码。

成功代码

# -*- coding: utf-8 -*-import pyhs2class HiveClient:    def __init__(self, db_host, database, port=10000, authMechanism="NOSASL"):        """        create connection to hive server2        """        self.conn = pyhs2.connect(host=db_host,                                  port=port,                                  authMechanism=authMechanism,                                  database=database,                                  )    def query(self, sql):        """        query        """        with self.conn.cursor() as cursor:            cursor.execute(sql)            return cursor.fetch()    def close(self):        """        close connection        """        self.conn.close()class connecthive():    def __init__(self, database):        self.hive_client = HiveClient(db_host='localhost', port=10000,                             database=database, authMechanism='NOSASL')    def query(self,sql):        """        query        """        result = self.hive_client.query(sql)        self.hive_client.close()        return result

这段代码成功让我看到了hive里的数据,总之,python连hive坑太多,入坑需谨慎啊!!!