elasticsearch之Document APIs【Bulk API】

来源：互联网发布：php开源网站统计系统编辑：程序博客网时间：2024/06/06 08:25

环境

elasticsearch：5.5

Bulk API

bulk API 可以在一次api调用中执行多个index/delete操作。这大大增加了索引（插入）速度。

REST API 是以/_bulk结尾，则其希望下面换行结果的json格式：

action_and_meta_data\noptional_source\naction_and_meta_data\noptional_source\n....action_and_meta_data\noptional_source\n

注意：数据的最后一行必须是换行符\n结尾。每个换行符之前可以添加\r。当发送这种请求时，其header的Content-Type应该设置为application/x-ndjson。

可能的操作有：index, create, delete 和 update。index和create期望source（可以理解为具体的数据）是在新的一行，并且与标准index api的op_type参数具有同样的语义（即：如果文档在同一个索引和类型中早已存在，那么创建就会失败，而索引将根据需要来添加或替换文档）。delete不期望在下一行有source（毕竟删除只需要指明id就行了），其和标准的delete api的语义是相同的。update期望在下一行指定部分doc，upsert和script及其他选项。

如果你提供文本文件输入给curl，那么你必须使用--data-binary标记，而不是-d。
后者不保留换行符。例如：

$ cat requests{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }{ "field1" : "value1" }$ curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@requests"; echo{"took":7, "errors": false, "items":[{"index":{"_index":"test","_type":"type1","_id":"1","_version":1,"result":"created","forced_refresh":false}}]}

因为这种格式使用了文字\n分隔符，请确保是json格式并且sources没有pretty打印。下面这个例子是bulk命令的正确序列：

POST _bulk{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }{ "field1" : "value1" }{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }{ "field1" : "value3" }{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} }{ "doc" : {"field2" : "value2"} }

可以看出，上面的delete只有一行，没有其具体数据（即：source）内容。

这个bulk操作结果如下：

{   "took": 30,   "errors": false,   "items": [      {         "index": {            "_index": "test",            "_type": "type1",            "_id": "1",            "_version": 1,            "result": "created",            "_shards": {               "total": 2,               "successful": 1,               "failed": 0            },            "created": true,            "status": 201         }      },      {         "delete": {            "found": false,            "_index": "test",            "_type": "type1",            "_id": "2",            "_version": 1,            "result": "not_found",            "_shards": {               "total": 2,               "successful": 1,               "failed": 0            },            "status": 404         }      },      {         "create": {            "_index": "test",            "_type": "type1",            "_id": "3",            "_version": 1,            "result": "created",            "_shards": {               "total": 2,               "successful": 1,               "failed": 0            },            "created": true,            "status": 201         }      },      {         "update": {            "_index": "test",            "_type": "type1",            "_id": "1",            "_version": 2,            "result": "updated",            "_shards": {                "total": 2,                "successful": 1,                "failed": 0            },            "status": 200         }      }   ]}

结尾可以是/_bulk, /{index}/_bulk, 和 {index}/{type}/_bulk。当提供了index或者index/type时，会在bulk每项中没有提供它们时作为默认值使用。

格式上注意。这里的想法是尽可能快的去处理。由于某些action将会被重定向到其他节点中其他分片上，因此，在接受这一方的节点只解析action_meta_data。

使用这个协议的客户端库应该尽可能的尝试在客户端上执行相似操作，并尽可能减少缓存。

bulk操作的响应是一个巨大的json结构，都是执行了每个操作的各个结果。单个操作失败不会影响其余操作。

在一次bulk的调用中，并没有执行actions的正确数量（即多少个actions才算合适），你应该尝试不同设置来找到符合你工作量的大小。

如果使用HTTP API，要确保客户端不发送HTTP块，因为这会降低效率。

Versioning

每个bulk的子项都可以使用_version/version字段来包含版本的值。它会基于_version映射，来自动跟踪index/delete的行为操作。它也支持version_type/_version_type。（详情请看versioning）

Routing

每个bulk的子项都可以使用_routing/routing字段来包括路由值。它会基于_routing映射自动跟踪index/delete的行为操作。

Parent

每个bulk的子项都可以使用_parent/parent字段来包括路由值。它会基于_parent / _routing映射自动跟踪index/delete的行为操作。

Wait For Active Shards

当调用bulk时，你可以设置wait_for_active_shards参数来设置在开始处理bulk请求之前分片副本存活的最小数量。有关详细信息和使用情况，请看这里

Update

当使用update操作时，可是使用_retry_on_conflict字段，该字段是该操作本身的字段（并不占用额外的有效行），它是指定了要是发生版本冲突update应该重试多少次。

update操作的有效选项，支持以下选项：doc （部分文档），upsert，doc_as_upsert，script和_source。有关选项的详细信息，请参阅update文档。

使用update操作：

POST _bulk{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }{ "doc" : {"field" : "value"} }{ "update" : { "_id" : "0", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }{ "script" : { "inline": "ctx._source.counter += params.param1", "lang" : "painless", "params" : {"param1" : 1}}, "upsert" : {"counter" : 1}}{ "update" : {"_id" : "2", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }{ "doc" : {"field" : "value"}, "doc_as_upsert" : true }{ "update" : {"_id" : "3", "_type" : "type1", "_index" : "index1", "_source" : true} }{ "doc" : {"field" : "value"} }{ "update" : {"_id" : "4", "_type" : "type1", "_index" : "index1"} }{ "doc" : {"field" : "value"}, "_source": true}

Security

查看：URL-based access control

官网地址：
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

阅读全文

0 0