ES权威指南_01_get start_03 Data In, Data Out

来源:互联网 发布:status monitor软件 编辑:程序博客网 时间:2024/06/05 13:32

https://www.elastic.co/guide/en/elasticsearch/guide/current/data-in-data-out.html

Elasticsearch is a distributed document store. It can store and retrieve complex data structures—serialized as JSON documents—in real time. In other words, as soon as a document has been stored in Elasticsearch, it can be retrieved from any node in the cluster

In Elasticsearch, all data in every field is indexed by default.
默认,所有字段都是索引的(indexed).
That is, every field has a dedicated inverted index for fast retrieval.
每一字段,有一个为了快速搜索,称之为“反向索引”的数据结构。

本章关注:how to store our documents safely in Elasticsearch and how to get them back again?

1 What Is a Do?

简单数据:JSON格式的数据。
Often, we use the terms object and document interchangeably.
However, there is a distinction. An object is just a JSON object—similar to what is known as a hash, hashmap, dictionary, or associative array. Objects may contain other objects. In Elasticsearch, the term document has a specific meaning. It refers to the top-level, or root object that is serialized into JSON and stored in Elasticsearch under a unique ID.

字段名称可以是任何有效的字符串,但不能包含句点(periods)。

2 Doc Metadata

doc=数据+元数据

3个基本的元数据:

  • _index ,Where the document lives
  • _type ,The class of object that the document represents
  • _id,The unique identifier for the document

_index

提示:Actually, in Elasticsearch, our data is stored and indexed in shards, while an index is just a logical namespace that groups together one or more shards. However, this is an internal detail; our application shouldn’t care about shards at all.

用户角度:index
底层:shard

索引名字:小写、不能下划线(_)开头、不能包含逗号、

_type

For example, all your products may go inside a single index. But you have different categories of products.

The documents all share an identical (or very similar) schema: they have a title, description, product code, price.

type , which allows you to logically partition data inside of an index.
Documents in different types may have different fields, but it is best if they are highly similar.

类型名:大小写都可以、不能下划线和点号开头,最多256个字符。

_id

combined with the _index and _type, uniquely identifies a document in Elasticsearch. When creating a new document, you can either provide your own _id or let Elasticsearch generate one for you.
_id:可自动生成,或自定义。

other

Which are presented in Types and Mappings.

3 Indexing a Doc

Documents are indexed—stored and made searchable—by using the index API.

自定义ID(+index+type:保证唯一)

PUT /{index}/{type}/{id}{  "field": "value",  ...}PUT /website/blog/123{  "title": "My first blog entry",  "text":  "Just trying this out...",  "date":  "2014/01/01"}// 响应{   "_index":    "website",   "_type":     "blog",   "_id":       "123",   "_version":  1, //版本,每次修改(包含删除),版本都会增加   "created":   true //是否创建,可能是修改}

In Dealing with Conflicts, we discuss how to use the _version number to ensure that one part of your application doesn’t overwrite changes made by another part.

在处理冲突时,我们讨论如何使用 _version 编号来确保应用程序的一部分不会覆盖另一部分所做的更改。

自动生成的ID

POST /website/blog/{  "title": "My second blog entry",  "text":  "Still trying this out...",  "date":  "2014/01/01"}{   "_index":    "website",   "_type":     "blog",   "_id":       "AVFgSgVHUP18jI2wRx0w",//自动生成的20个字符串长的ID   "_version":  1,   "created":   true}

Autogenerated IDs are 20 character long, URL-safe, Base64-encoded GUID(全局唯一标识) strings,which allows multiple nodes to be generating unique IDs in parallel with essentially zero chance of collision.

4 Retrieving a Doc

GET /website/blog/123?pretty结果:{  "_index" :   "website",  "_type" :    "blog",  "_id" :      "123",  "_version" : 1,  "found" :    true, //查到了,否则是false(同时HTTP返回404,而不是200)  "_source" :  {//contains the original JSON doc      "title": "My first blog entry",      "text":  "Just trying this out...",      "date":  "2014/01/01"  }}

添加pretty参数在任何请求中,使得ES pretty-print the JSON response to make it more readable. The _source field, however, isn’t pretty-printed.

curl -i -XGET http://localhost:9200/website/blog/124?pretty查询不存在的文档:HTTP/1.1 404 Not FoundContent-Type: application/json; charset=UTF-8Content-Length: 83{  "_index" : "website",  "_type" :  "blog",  "_id" :    "124",  "found" :  false}

检索doc的部分字段:

GET /website/blog/123?_source=title,text

通过_source参数,指定请求的字段。

若只希望返回_source,而不包含任何元数据(重要):

GET /website/blog/123/_source结果:{   "title": "My first blog entry",   "text":  "Just trying this out...",   "date":  "2014/01/01"}

5 Checking Whether a Doc Exists

curl -i -XHEAD http://localhost:9200/website/blog/123

若存在:
HTTP/1.1 200 OK
若不存在:
HTTP/1.1 404 Not Found

Of course, just because a document didn’t exist when you checked it, doesn’t mean that it won’t exist a millisecond later: another process might create the document in the meantime.

6 Updating a Whole Doc【PUT】

PUT /website/blog/123{  "title": "My first blog entry",  "text":  "I am starting to get the hang of this...",  "date":  "2014/01/02"}结果:{  "_index" :   "website",  "_type" :    "blog",  "_id" :      "123",  "_version" : 2, //版本增加1  "created":   false //非第一次创建,"index+type+id"已存在。}

Internally, Elasticsearch has marked the old document as deleted and added an entirely new document.

修改的本质:删除(标记)、新建,merge时真删除。

The old version of the document doesn’t disappear immediately, although you won’t be able to access it.

部分更新: partial updates to a document.

更新步骤:

  1. Retrieve the JSON from the old document
  2. Change it
  3. Delete the old document
  4. Index a new document

update API achieves this through a single client request,而不是get再index.

7 Creating a New Doc【POST】

creating an entirely new document and not overwriting an existing one?

the combination of _index, _type, and _id uniquely identifies a document.

通过自动生成ID保证:POST /website/blog/{ ... }
通过指定op_type参数保证:PUT /website/blog/123?op_type=create{ ... }
通过指定 _create endpoint 保障:PUT /website/blog/123/_create{ ... }

若成功返回 201 Created,否则 409 Conflict

{   "error": {      "root_cause": [         {            "type": "document_already_exists_exception",            "reason": "[blog][123]: document already exists",            "shard": "0",            "index": "website"         }      ],      "type": "document_already_exists_exception",      "reason": "[blog][123]: document already exists",      "shard": "0",      "index": "website"   },   "status": 409  //}

练习:
什么时候返回200,什么时候返回201?

8 Deleting a Doc

DELETE /website/blog/123结果:200 OK{  "found" :    true,  "_index" :   "website",  "_type" :    "blog",  "_id" :      "123",  "_version" : 3 //删除也是修改,版本增加}

若没有查到文档:

{  "found" :    false,//未找到doc  "_index" :   "website",  "_type" :    "blog",  "_id" :      "123",  "_version" : 4  //版本依旧增加?}

Even though the document doesn’t exist (found is false), the _version number has still been incremented. This is part of the internal bookkeeping, which ensures that changes are applied in the correct order across multiple nodes.

注意:删除,是标记删除,merge时真删除。

9 Dealing with Conflicts

更新:查-删-写,the most recent indexing request wins,所以更新可能丢失。

悲观锁(Pessimistic):DB中常用,读之前,行锁、表锁。
乐观锁(Optimistic):ES中常用,if the underlying data has been modified between reading and writing, the update will fail.

10 Optimistic Concurrency Control【乐观锁:_version】

When documents are created, updated, or deleted, the new version of the document has to be replicated to other nodes in the cluster.

创建,更新或删除文档时,必须将新版本的文档复制到集群中的其他节点

Elasticsearch is also asynchronous and concurrent(异步和并发), meaning that these replication requests are sent in parallel, and may arrive at their destination out of sequence. Elasticsearch needs a way of ensuring that an older version of a document never overwrites a newer version.

PUT /website/blog/1/_create{  "title": "My first blog entry",  "text":  "Just trying this out..."}

指定版本的修改:

PUT /website/blog/1?version=1 {  "title": "My first blog entry",  "text":  "Starting to get the hang of this..."}

We want this update to succeed only if the current _version of this document in our index is version 1.

成功:

{  "_index":   "website",  "_type":    "blog",  "_id":      "1",  "_version": 2 // +1  "created":  false}

失败:

{   "error": {      "root_cause": [         {            "type": "version_conflict_engine_exception",            "reason": "[blog][1]: version conflict, current [2], provided [1]",            "index": "website",            "shard": "3"         }      ],      "type": "version_conflict_engine_exception",      "reason": "[blog][1]: version conflict, current [2], provided [1]",      "index": "website",      "shard": "3"   },   "status": 409}

如上,当前版本是2,指定更新的版本为1,所以失败.

指定外部版本( “=” – –> “<” )

Instead of checking that the current _version is the same as the one specified in the request,Elasticsearch checks that the current _version is less than the specified version.
If the request succeeds, the external version number is stored as the document’s new _version.

External version numbers can be specified not only on index and delete requests, but also when creating new documents.

新建:

PUT /website/blog/2?version=5&version_type=external{  "title": "My first external blog entry",  "text":  "Starting to get the hang of this..."}响应:{  "_index":   "website",  "_type":    "blog",  "_id":      "2",  "_version": 5,  "created":  true}

更新:

PUT /website/blog/2?version=10&version_type=external{  "title": "My first external blog entry",  "text":  "This is a piece of cake..."}响应:{  "_index":   "website",  "_type":    "blog",  "_id":      "2",  "_version": 10,  "created":  false}

11 Partial Updates to Doc

documents are immutable: they cannot be changed, only replaced.
The update API must obey the same rules.

update: retrieve-change-reindex
The difference is that this process happens within a shard, thus avoiding the network overhead of multiple requests.

部分更新:注意doc参数,新增tags、views字段

POST /website/blog/1/_update{   "doc" : {      "tags" : [ "testing" ],      "views": 0   }}

成功:

{   "_index" :   "website",   "_id" :      "1",   "_type" :    "blog",   "_version" : 3 //}

使用scripts部分更新【ctx._source】

POST /website/blog/1/_update{   "script" : "ctx._source.views+=1"}

注意:
Scripting is supported in many APIs including search, sorting, aggs, and doc updates.
默认脚本语言:Groovy,fast and expressive,类似JavaScript语法,从v1.3.0引入,运行在sandbox,vulnerability , v1.3.8, v1.4.3, and version v1.5.0 and newer it has been disabled by default. 。

  • .scripts index
  • config/scripts/ directory
  • request parameter

禁用:script.groovy.sandbox.enabled: false

使用script给数组字段加新元素:

POST /website/blog/1/_update{   "script" : "ctx._source.tags+=new_tag",   "params" : {      "new_tag" : "search"   }}

复用script编译结果:This allows Elasticsearch to reuse the script in the future, without having to compile a new script every time we want to add another tag:

通过script删除【ctx.op=delete】:

POST /website/blog/1/_update{   "script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'",    "params" : {        "count": 1    }}

通过script更新或新建【upsert】:

POST /website/pageviews/1/_update{   "script" : "ctx._source.views+=1",//修改+1   "upsert": {       "views": 1 //新建初始值1   }}

Updates and Conflicts:

POST /website/pageviews/1/_update?retry_on_conflict=5 {   "script" : "ctx._source.views+=1",   "upsert": {       "views": 0   }}

This works well for operations such as incrementing a counter, where the order of increments does not matter, but in other situations the order of changes is important.

Like the index API, the update API adopts a last-write-wins approach by default, but it also accepts a version parameter that allows you to use optimistic concurrency control to specify which version of the document you intend to update.

参考:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-scripting.html

12 Retrieving Multiple Doc【mget】

Combining multiple requests into one avoids the network overhead of processing each request individually.

参数:docs

GET /_mget{   "docs" : [      {         "_index" : "website",         "_type" :  "blog",         "_id" :    2      },      {         "_index" : "website",         "_type" :  "pageviews",         "_id" :    1,         "_source": "views"      }   ]}响应:{   "docs" : [      {         "_index" :   "website",         "_id" :      "2",         "_type" :    "blog",         "found" :    true,         "_source" : {            "text" :  "This is a piece of cake...",            "title" : "My first external blog entry"         },         "_version" : 10      },

在URL中指定默认的index/type:

GET /website/blog/_mget{   "docs" : [      { "_id" : 2 },      { "_type" : "pageviews", "_id" :   1 }   ]}

所有docs含有相同的index/type,只需ids参数:

GET /website/blog/_mget{   "ids" : [ "2", "1" ]}//响应:一个存在,一个不存在{  "docs" : [    {      "_index" :   "website",      "_type" :    "blog",      "_id" :      "2",      "_version" : 10,      "found" :    true,      "_source" : {        "title":   "My first external blog entry",        "text":    "This is a piece of cake..."      }    },    {      "_index" :   "website",      "_type" :    "blog",      "_id" :      "1",       "found" :    false // 未找到,在pageviews的type中,但不影响doc 2.    }  ]}

Each doc is retrieved and reported on individually.

注意:mget请求的结果200和文档是否存在无关。The HTTP status code for the preceding request is 200, even though one document wasn’t found. In fact, it would still be 200 if none of the requested documents were found—because the mget request itself completed successfully. To determine the success or failure of the individual documents, you need to check the found flag.

13 Cheaper in Bulk

bulk:create, index, update, or delete【增删改查】。
make multiple create, index, update, or delete requests in a single step.

格式:

{ action: { metadata }}\n{ request body        }\n{ action: { metadata }}\n{ request body        }\n...

action :create、index(新建或更新)、update(部分更新)、delete。
metadata :_index、_type、_id 、路由等

注意:

  • Every line must end with a newline character (\n), including the last line.
  • The lines cannot contain unescaped newline characters, as they would interfere with parsing.

为什么需要这种方式:

  • Parse the JSON into an array (including the document data, which can be very large)
  • Look at each request to determine which shard it should go to
  • Create an array of requests for each shard
  • Serialize these arrays into the internal transport format
  • Send the requests to each shard
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}{ "title":    "My first blog post" }
POST /_bulk{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} { "create": { "_index": "website", "_type": "blog", "_id": "123" }}{ "title":    "My first blog post" }{ "index":  { "_index": "website", "_type": "blog" }}{ "title":    "My second blog post" }{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }{ "doc" : {"title" : "My updated blog post"} } 
  • delete action does not have a request body;
  • Remember the final newline character.

响应:

{   "took": 4,   "errors": false,    "items": [//包含items array, lists the result of each request      {  "delete": {            "_index":   "website",            "_type":    "blog",            "_id":      "123",            "_version": 2,            "status":   200, //删除200            "found":    true      }},      {  "create": {            "_index":   "website",            "_type":    "blog",            "_id":      "123",            "_version": 3,            "status":   201 //创建201      }},      {  "create": {            "_index":   "website",            "_type":    "blog",            "_id":      "EiwfApScQiiy7TIKFxRCTw",            "_version": 1,            "status":   201      }},      {  "update": {            "_index":   "website",            "_type":    "blog",            "_id":      "123",            "_version": 4,            "status":   200 //更新200      }}   ]}

bulk中各子请求独立执行。Each subrequest is executed independently, so the failure of one subrequest won’t affect the success of the others.

If any of the requests fail, the top-level error flag is set to true and the error details will be reported under the relevant request:

{   "took": 3,   "errors": true, //有失败的请求,可能全部、部分   "items": [      {  "create": {            "_index":   "website",            "_type":    "blog",            "_id":      "123",            "status":   409, //失败            "error":    "DocumentAlreadyExistsException                         [[website][4] [blog][123]:                        document already exists]"      }},      {  "index": {            "_index":   "website",            "_type":    "blog",            "_id":      "123",            "_version": 5,            "status":   200  //成功      }}   ]}

bulk非原子的,非事务安全的: bulk requests are not atomic:they cannot be used to implement transactions.

Don’t Repeat Yourself【URL指定默认值】:

POST /website/_bulk{ "index": { "_type": "log" }}{ "event": "User logged in" }

重写默认值:

POST /website/log/_bulk{ "index": {}}{ "event": "User logged in" }{ "index": { "_type": "blog" }}{ "title": "Overriding the default type" }

How Big Is Too Big?
The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests.

The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.

A good place to start is with batches of 1,000 to 5,000 doc, if your docs are very large, with even smaller batches.

A good bulk size to start playing with is around 5-15MB in size.

HTTP请求默认允许最大100MB,可修改。

0 0