【工作笔记】ElasticSearch从零开始学（二）—— 入门（搜索）

来源：互联网发布：网络大病众筹编辑：程序博客网时间：2024/05/22 12:56

建立一个员工目录

假设我们刚好在Megacorp工作，这时人力资源部门出于某种目的需要让我们创建一个员工目录，这个目录用于促进人文关怀和用于实时协同工作，所以它有以下不同的需求

数据能够包含多个值的标签、数字和纯文本。
检索任何员工的所有信息。
支持结构化搜索，例如查找30岁以上的员工。
支持简单的全文搜索和更复杂的短语(phrase)搜索
高亮搜索结果中的关键字
能够利用图表管理分析这些数据

索引员工文档

首先要做的是存储员工数据，每个文档代表一个员工。

在Elasticsearch中存储数据的行为就叫做索引(indexing)，不过在索引之前，我们需要明确数据应该存储在哪里。

在Elasticsearch中，文档归属于一种类型(type),而这些类型存在于索引(index)中，我们可以画一些简单的对比图来类比传统关系型数据库

Relational DB -> Databases -> Tables -> Rows -> ColumnsElasticsearch -> Indices   -> Types  -> Documents -> Fields

Elasticsearch集群可以包含多个索引(indices)（数据库），每一个索引可以包含多个类型(types)（表），每一个类型包含多个文档(documents)（行），然后每个文档包含多个字段(Fields)（列）

「索引」含义的区分
你可能已经注意到索引(index)这个词在Elasticsearch中有着不同的含义，所以有必要在此做一下区分:
索引（名词）如上文所述，一个索引(index)就像是传统关系数据库中的数据库，它是相关文档存储的地方，index的复数是indices 或indexes。
索引（动词） 「索引一个文档」表示把一个文档存储到索引（名词）里，以便它可以被检索或者查询。这很像SQL中的INSERT关键字，差别是，如果文档已经存在，新的文档将覆盖旧的文档。
倒排索引传统数据库为特定列增加一个索引，例如B-Tree索引来加速检索。Elasticsearch和Lucene使用一种叫做倒排索引(inverted index)的数据结构来达到相同目的。

注：默认情况下，文档中的所有字段都会被索引（拥有一个倒排索引），只有这样他们才是可被搜索的

为了创建员工目录，我们将进行如下操作

为每个员工的文档(document)建立索引，每个文档包含了相应员工的所有信息。
每个文档的类型为employee。
employee类型归属于索引megacorp。
megacorp索引存储在Elasticsearch集群中

//新增  /索引名/类型名/员工IDPUT /megacorp/employee/1{    "first_name" : "John",    "last_name" :  "Smith",    "age" :        25,    "about" :      "I love to go rock climbing",    "interests": [ "sports", "music" ]}PUT /megacorp/employee/3{    "first_name" :  "Douglas",    "last_name" :   "Fir",    "age" :         35,    "about":        "I like to build cabinets",    "interests":  [ "forestry" ]}

检索文档

第一个需求是能够检索单个员工的信息

GET /megacorp/employee/1

响应的内容中包含一些文档的元信息，John Smith的原始JSON文档包含在_source字段中

{  "_index" :   "megacorp",  "_type" :    "employee",  "_id" :      "1",  "_version" : 1,  "found" :    true,  "_source" :  {      "first_name" :  "John",      "last_name" :   "Smith",      "age" :         25,      "about" :       "I love to go rock climbing",      "interests":  [ "sports", "music" ]  }}

我们通过HTTP方法GET来检索文档，同样的，我们可以使用DELETE方法删除文档，使用HEAD方法检查某文档是否存在。如果想更新已存在的文档，我们只需再PUT一次

简单搜索

尝试一个最简单的搜索全部员工的请求

GET /megacorp/employee/_search

响应内容的hits数组中包含了我们所有的三个文档。默认情况下搜索会返回前10个结果

{   "took":      6,   "timed_out": false,   "_shards": { ... },   "hits": {      "total":      3,      "max_score":  1,      "hits": [         {            "_index":         "megacorp",            "_type":          "employee",            "_id":            "3",            "_score":         1,            "_source": {               "first_name":  "Douglas",               "last_name":   "Fir",               "age":         35,               "about":       "I like to build cabinets",               "interests": [ "forestry" ]            }         },         {            "_index":         "megacorp",            "_type":          "employee",            "_id":            "1",            "_score":         1,            "_source": {               "first_name":  "John",               "last_name":   "Smith",               "age":         25,               "about":       "I love to go rock climbing",               "interests": [ "sports", "music" ]            }         },         {            "_index":         "megacorp",            "_type":          "employee",            "_id":            "2",            "_score":         1,            "_source": {               "first_name":  "Jane",               "last_name":   "Smith",               "age":         32,               "about":       "I like to collect rock albums",               "interests": [ "music" ]            }         }      ]   }}

搜索姓氏中包含“Smith”的员工。要做到这一点，我们将在命令行中使用轻量级的搜索方法。这种方法常被称作查询字符串(query string)搜索

//在请求中依旧使用_search关键字，然后将查询语句传递给参数q=GET /megacorp/employee/_search?q=last_name:Smith

{   ...   "hits": {      "total":      2,      "max_score":  0.30685282,      "hits": [         {            ...            "_source": {               "first_name":  "John",               "last_name":   "Smith",               "age":         25,               "about":       "I love to go rock climbing",               "interests": [ "sports", "music" ]            }         },         {            ...            "_source": {               "first_name":  "Jane",               "last_name":   "Smith",               "age":         32,               "about":       "I like to collect rock albums",               "interests": [ "music" ]            }         }      ]   }}

使用DSL语句查询

查询字符串搜索便于通过命令行完成特定(ad hoc)的搜索，但是它也有局限性（参阅简单搜索章节）。Elasticsearch提供丰富且灵活的查询语言叫做DSL查询(Query DSL),它允许你构建更加复杂、强大的查询

DSL(Domain Specific Language特定领域语言)以JSON请求体的形式出现。我们可以这样表示之前关于“Smith”的查询

//相当于/megacorp/employee/_search?q=last_name:"Smith"GET /megacorp/employee/_search{    "query" : { //查询        "match" : { //匹配            "last_name" : "Smith" //条件...        }    }}

我们不再使用查询字符串(query string)做为参数，而是使用请求体代替。这个请求体使用JSON表示，其中使用了match语句

更复杂的搜索

让搜索稍微再变的复杂一些。我们依旧想要找到姓氏为“Smith”的员工，但是我们只想得到年龄大于30岁的员工。我们的语句将添加过滤器(filter),它使得我们高效率的执行一个结构化搜索

GET /megacorp/employee/_search{    "query" : {        "filtered" : {            "filter" : {                "range" : {                    "age" : { "gt" : 30 } <1>                }            },            "query" : {                "match" : {                    "last_name" : "smith" <2>                }            }        }    }}

<1> 这部分查询属于区间过滤器(range filter),它用于查找所有年龄大于30岁的数据——gt为”greater than”的缩写。
<2> 这部分查询与之前的match语句(query)一致

{   ...   "hits": {      "total":      1,      "max_score":  0.30685282,      "hits": [         {            ...            "_source": {               "first_name":  "Jane",               "last_name":   "Smith",               "age":         32,               "about":       "I like to collect rock albums",               "interests": [ "music" ]            }         }      ]   }}

全文搜索

到目前为止搜索都很简单：搜索特定的名字，通过年龄筛选。让我们尝试一种更高级的搜索，全文搜索——一种传统数据库很难实现的功能。
将会搜索所有喜欢“rock climbing”的员工

GET /megacorp/employee/_search{    "query" : {        "match" : {            "about" : "rock climbing"        }    }}

使用了之前的match查询，从about字段中搜索”rock climbing”，我们得到了两个匹配文档

{   ...   "hits": {      "total":      2,      "max_score":  0.16273327,      "hits": [         {            ...            "_score": 0.16273327, <1>            "_source": {               "first_name":  "John",               "last_name":   "Smith",               "age":         25,               "about":       "I love to go rock climbing",               "interests": [ "sports", "music" ]            }         },         {            ...            "_score": 0.016878016, <2>            "_source": {               "first_name":  "Jane",               "last_name":   "Smith",               "age":         32,               "about":       "I like to collect rock albums",               "interests": [ "music" ]            }         }      ]   }}

<1><2> 结果相关性评分

默认情况下，Elasticsearch根据结果相关性评分来对结果集进行排序，所谓的「结果相关性评分」就是文档与查询条件的匹配程度。很显然，排名第一的John Smith的about字段明确的写到“rock climbing”。
但是为什么Jane Smith也会出现在结果里呢？原因是“rock”在她的abuot字段中被提及了。因为只有“rock”被提及而“climbing”没有，所以她的_score要低于John

这个例子很好的解释了Elasticsearch如何在各种文本字段中进行全文搜索，并且返回相关性最大的结果集。相关性(relevance)的概念在Elasticsearch中非常重要，而这个概念在传统关系型数据库中是不可想象的，因为传统数据库对记录的查询只有匹配或者不匹配

短语搜索

目前我们可以在字段中搜索单独的一个词，这挺好的，但是有时候你想要确切的匹配若干个单词或者短语(phrases)。例如我们想要查询同时包含”rock”和”climbing”（并且是相邻的）的员工记录

只要将match查询变更为match_phrase查询即可

GET /megacorp/employee/_search{    "query" : {        "match_phrase" : {            "about" : "rock climbing"        }    }}

{   ...   "hits": {      "total":      1,      "max_score":  0.23013961,      "hits": [         {            ...            "_score":         0.23013961,            "_source": {               "first_name":  "John",               "last_name":   "Smith",               "age":         25,               "about":       "I love to go rock climbing",               "interests": [ "sports", "music" ]            }         }      ]   }}

高亮搜索

很多应用喜欢从每个搜索结果中高亮(highlight)匹配到的关键字，这样用户可以知道为什么这些文档和查询相匹配。在Elasticsearch中高亮片段是非常容易的

在之前的语句上增加highlight参数

GET /megacorp/employee/_search{    "query" : {        "match_phrase" : {            "about" : "rock climbing"        }    },    "highlight": {        "fields" : {            "about" : {}        }    }}

当我们运行这个语句时，会命中与之前相同的结果，但是在返回结果中会有一个新的部分叫做highlight，这里包含了来自about字段中的文本，并且用来标识匹配到的单词

{   ...   "hits": {      "total":      1,      "max_score":  0.23013961,      "hits": [         {            ...            "_score":         0.23013961,            "_source": {               "first_name":  "John",               "last_name":   "Smith",               "age":         25,               "about":       "I love to go rock climbing",               "interests": [ "sports", "music" ]            },            "highlight": {               "about": [                  "I love to go <em>rock</em> <em>climbing</em>" <1>               ]            }         }      ]   }}

<1> 原有文本中高亮的片段

以上转载
https://es.xiaoleilu.com/010_Intro/30_Tutorial_Search.html

0 0