Hive2ElasticSearch

来源:互联网 发布:日系男装 知乎 编辑:程序博客网 时间:2024/06/03 23:28

Hive2ElasticSearch

目标是将hive数据导入到es中。
起初通过读hive中的表文件使用bulk api来向es推数,但是效率太低,满足不了大数据的需求。
后来发现es官方提供了解决方案es-hadoop。在此记录一些使用中遇到的问题。

ES-hadoop为elasticsearch提供了一个hive存储处理器(Hive storage handler),使开发人员可以直接定义一个外部表,通过向外部表写数,实现向es中推数。

定义表

示例:

    CREATE EXTERNAL TABLE artists (        id      BIGINT,        name    STRING,        links   STRUCT<url:STRING, picture:STRING>)    STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'    TBLPROPERTIES('es.resource' = 'radio/artists');

STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’确定了此表的存储处理方式。TBLPROPERTIES中的内容是es-hadoop的配置信息,es-hadoop的所有配置说明在传送门

在使用的过程中发现以下配置需要留意:
1.es.write.operation

es.write.operation (default index)The write operation elasticsearch-hadoop should peform - can be any of:index (default)new data is added while existing data (based on its id) is replaced (reindexed).createadds new data - if the data already exists (based on its id), an exception is thrown.updateupdates existing data (based on its id). If no data is found, an exception is thrown.upsertknown as merge or insert if the data does not exist, updates if the data exists (based on its id).

2.es.mapping.id

es.mapping.id (default none)The document field/property name containing the document id.

最好设置一个可控的id,’es.mapping.id’ = ‘id’,这里的id是定义外部表的字段。

使用中的问题

1.版本问题
要根据es的版本选择对应的es-hadoop的版本,起初在使用的时候报错提示不支持的es版本,取es官网查没有查到es-hadoop版本和es版本的对应信息,后再在github上看到此信息

ES-Hadoop 6.x and higher are compatible with Elasticsearch 1.X, 2.X, 5.X, and 6.XES-Hadoop 5.x and higher are compatible with Elasticsearch 1.X, 2.X and 5.XES-Hadoop 2.2.x and higher are compatible with Elasticsearch 1.X and 2.XES-Hadoop 2.0.x and 2.1.x are compatible with Elasticsearch 1.X only

2.Maybe ES was overloaded
在向定义的外部表插入数据时报错

Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [166/166] (Maybe ES was overloaded?). Error sample (first [5] error messages):rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@3b0002f4 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@75d20902[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 2119432]]

百度了一下stackoverflow上有相关问题传送门,按照回答调整了es.batch.size.entries还是出现此问题。后来想到es负载过重,会不会hadoop集群与es集群能力相差太多,hadoop集群运行时map数过多导致es集群负载过重,在hive中配置了
set mapreduce.job.running.map.limit=50,控制同时运行的map数,起初设置200,还是出错,不断调整到50的时候,没有出现问题。

3.whose UTF8 encoding is longer than the max length 32766
详细信息

Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [172.28.147.73:30000] returned Bad Request(400) - Document contains at least one immense term in field="st" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[34, 45, 45, 62, 39, 45, 45, 62, 96, 45, 45, 62, 60, 33, 45, 45, 35, 83, 69, 84, 37, 50, 48, 86, 65, 82, 61, 34, 72, 57]...', original message: bytes can be at most 32766 in length; got 113469; Bailing out..at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:251)at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:203)at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:220)

出错的字段长度过长并且字段设置index=not_analyzed 这是因为设置后,不会被分词,但是es建立索引时候,对不被分词的字符串要整个建立倒排索引,es不允许这么长的字段占用太多资源。
起初我认为出错字段没有可能这么长,以为是编码问题,后来通过验证数据发现是由于上层数据处理有问题,使这个字段中有异常值。
总结教训:有问题不要我以为!要用事实说话

在运行的时候发现向es的外部表推数时,若是有复杂的sql语句,效率会非常低(暂时不知道原因),把复杂的sql语句和向es表insert的逻辑拆开,通过加工一个中间表,然后读中间表数据插入es表。

参考文献:
1.https://www.elastic.co/guide/en/elasticsearch/hadoop/master/reference.html
2.https://github.com/elastic/elasticsearch-hadoop
3.https://stackoverflow.com/questions/29843898/spark-streaming-and-elasticsearch-could-not-write-all-entries
4.http://blog.csdn.net/woshiaotian/article/details/51088156