介绍

简介

我们的应用经常需要使用检索功能，开源的 Elasticsearch 是目前全文搜索引擎的首选。它可以快速的存储、搜索和分析海量数据。SpringBoot 通过整合 SpringData Elasticsearch 为我们提供了非常便捷的检索功能支持。

Elasticsearch 是一个分布式搜索服务，提供 Restful API，底层基于 Lucene，采用多 shard（分片）的方式保证数据安全，并且提供自动 resharding 的功能，Github 等大型的站点也是采用了 Elasticsearch 作为其搜索服务。

Elasticsearch 官网 | Elasticsearch 权威指南 | 关注文章首部微信公众号发送 #141_elasticsearch_doc 获取权威指南离线版
安装可参见【Docker安装ElasticSearch】。

核心概念

下面引用权威指南中的一段话：

应用中的对象很少只是简单的键值列表，更多时候它拥有复杂的数据结构，比如包含日期、地理位置、另一个对象或者数组。
总有一天你会想到把这些对象存储到数据库中。将这些数据保存到由行和列组成的关系数据库中，就好像是把一个丰富，信息表现力强的对象拆散了放入一个非常大的表格中：你不得不拆散对象以适应表模式（通常一列表示一个字段），然后又不得不在查询的时候重建它们。
Elasticsearch 是面向文档 (document oriented) 的，这意味着它可以存储整个对象或文档 (document) 。然而它不仅仅是存储，还会索引 (index) 每个文档的内容使之可以被搜索。在 Elasticsearch 中，你可以对文档（而非成行成列的数据）进行索引、搜索、排序、过滤。这种理解数据的方式与以往完全不同，这也是 Elasticsearch 能够执行复杂的全文搜索的原因之一。

涉及如下几个概念：

Document：文档，通常一个对象就用一个文档表示，保存到 Elasticsearch 中数据的最小单位，存储格式为 Json，即一个文档也是一个 Json 对象。
Index：索引，在 Elasticsearch 中有两种语境：1、索引（动词）一个文档，表示将文档存储到 Elasticsearch 中的行为；2、进行文档的索引过程中会生成一个索引（名词），类似于传统关系型数据库中的一个数据库，是一个存储文档的地方。
Type：类型，索引一个文档时要指定文档将要保存在哪个位置，这个位置的描述就为类型。

使用示例

我们只需要 http 请求的方式来操作 Elasticserach 服务。

索引文档

以索引一个员工对象（文档）操作为例，只需要对 Elasticsearch 发送一个如下 restful 风格的 put 请求：

PUT /megacorp/employee/1
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}

响应：

{
    "_index": "megacorp",
    "_type": "employee",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "created": true
}

path /megacorp/employee/1 包含三部分信息：

名字	说明
megacorp	索引名
employee	类型名
1	员工Id

我们可以接着保存 Id 为 2、3 的员工：

PUT /megacorp/employee/2
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}
PUT /megacorp/employee/3
{
"first_name" : "Douglas",
"last_name" : "Fir",
"age" : 35,
"about": "I like to build cabinets",
"interests": [ "forestry" ]
}

要更新一个已有的文档，同样可以以该方式。

检索文档
以查询 megacorp 索引的 employee 类型下 id 为 1 的员工为例，我们只需要发送一个如下 restful 风格的 get 请求：

GET /megacorp/employee/1

响应：

{
    "_index": "megacorp",
    "_type": "employee",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
        "first_name": "John",
        "last_name": "Smith",
        "age": 25,
        "about": "I love to go rock climbing",
        "interests": [
            "sports",
            "music"
        ]
    }
}

检查文档

以检查 id 为 1 的员工是否存在为例，我们只需要发送一个 restful 风格的 head 请求：

HEAD /megacorp/employee/1

该请求没有响应体，而是以响应状态码为标识。如果存在这个员工，响应状态码为 200，否则为 404。

删除文档

以删除 id 为 1 的员工为例，我们只需要发送一个 restful 风格的 delete 请求：

DELETE /megacorp/employee/1

响应：

{
    "found": true,
    "_index": "megacorp",
    "_type": "employee",
    "_id": "1",
    "_version": 2,
    "result": "deleted",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    }
}

轻量搜索

简单搜索

上面我们已经知道了如何获取一个指定 id 的文档，还可以通过如下方式搜索指定索引的类型下所有文档：

GET /megacorp/employee/_search

响应：

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 1,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "2",
                "_score": 1,
                "_source": {
                    "first_name": "Jane",
                    "last_name": "Smith",
                    "age": 32,
                    "about": "I like to collect rock albums",
                    "interests": [
                        "music"
                    ]
                }
            },
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith",
                    "age": 25,
                    "about": "I love to go rock climbing",
                    "interests": [
                        "sports",
                        "music"
                    ]
                }
            },
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "3",
                "_score": 1,
                "_source": {
                    "first_name": "Douglas",
                    "last_name": "Fir",
                    "age": 35,
                    "about": "I like to build cabinets",
                    "interests": [
                        "forestry"
                    ]
                }
            }
        ]
    }
}

通过 url 参数根据指定字段值搜索文档，以搜索姓氏中包含 Smith 的员工为例：

GET /megacorp/employee/_search?q=last_name:Smith

响应：

{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "2",
                "_score": 0.2876821,
                "_source": {
                    "first_name": "Jane",
                    "last_name": "Smith",
                    "age": 32,
                    "about": "I like to collect rock albums",
                    "interests": [
                        "music"
                    ]
                }
            },
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith",
                    "age": 25,
                    "about": "I love to go rock climbing",
                    "interests": [
                        "sports",
                        "music"
                    ]
                }
            }
        ]
    }
}

DSL 查询

Elasticsearch 提供了丰富且灵活的查询语言叫做 DSL ( Domain Specific Language：特定领域语言 ) 查询，它能够构建更复杂、强大的查询。DSL 以 Json 请求体的形式出现。

我们可以这样表示之前关于 Smith 的查询：

GET /megacorp/employee/_search
{
"query" : {
    "match" : {
        "last_name" : "Smith"
        }
    }
}

让搜索稍微再变的复杂一些。我们依旧想要找到姓氏为 Smith 的员工，但是我们只想得到年龄大于 30 岁的员工。我们的语句将添加过滤器 filter，它使得我们高效率的执行一个结构化搜索：

{
    "query": {
        "bool": {
            "filter": {
                "range": {
                    "age": {
                        "gt": 30
                    }
                }
            },
            "must": {
                "match": {
                    "last_name": "smith"
                }
            }
        }
    }
}

响应：

{
    "took": 42,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "2",
                "_score": 0.2876821,
                "_source": {
                    "first_name": "Jane",
                    "last_name": "Smith",
                    "age": 32,
                    "about": "I like to collect rock albums",
                    "interests": [
                        "music"
                    ]
                }
            }
        ]
    }
}

全文搜索

我们尝试一种更高级的搜索，全文搜索——一种传统数据库很难实现的功能。我们将会搜索所有喜欢 rock climbing 的员工：

GET /megacorp/employee/_search
{
    "query": {
        "match": {
            "about": "rock climbing"
        }
    }
}

响应：

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.53484553,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "1",
                "_score": 0.53484553,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith",
                    "age": 25,
                    "about": "I love to go rock climbing",
                    "interests": [
                        "sports",
                        "music"
                    ]
                }
            },
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "2",
                "_score": 0.26742277,
                "_source": {
                    "first_name": "Jane",
                    "last_name": "Smith",
                    "age": 32,
                    "about": "I like to collect rock albums",
                    "interests": [
                        "music"
                    ]
                }
            }
        ]
    }
}

默认情况下，Elasticsearch 根据结果相关性评分来对结果集进行排序，所谓的「结果相关性评分」就是文档与查询条件的匹配程度。很显然，排名第一的 John Smith 的 about 字段明确的写到 rock climbing。

但是为什么 Jane Smith 也会出现在结果里呢？原因是 rock 在她的 about 字段中被提及了。因为只有rock 被提及而 climbing 没有，所以它的 _score 要低于 John。

这个例子很好的解释了 Elasticsearch 如何在各种文本字段中进行全文搜索，并且返回相关性最大的结果集。相关性 (relevance)的概念在 Elasticsearch 中非常重要，而这个概念在传统关系型数据库中是不可想象的，因为传统数据库对记录的查询只有匹配或者不匹配。

短语检索

上面全文检索方式是经过分词后的搜索，如果我们想要不分词查询 about 字段包含 rock climbing 的员工记录，只需要将 match 查询变更为 match_phrase 即可：

GET /megacorp/employee/_search
{
    "query": {
        "match_phrase": {
            "about": "rock climbing"
        }
    }
}

响应：

{
    "took": 24,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.53484553,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "1",
                "_score": 0.53484553,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith",
                    "age": 25,
                    "about": "I love to go rock climbing",
                    "interests": [
                        "sports",
                        "music"
                    ]
                }
            }
        ]
    }
}

高亮搜索

从每个搜索结果中高亮 (highlight) 匹配到的关键字，这样用户可以知道为什么这些文档和查询相匹配。在 Elasticsearch 中高亮片段是非常容易的。让我们在之前的语句上增加 highlight 参数：

GET /megacorp/employee/_search
{
    "query": {
        "match_phrase": {
            "about": "rock climbing"
        }
    },
    "highlight": {
        "fields": {
            "about": {}
        }
    }
}

响应：

{
    "took": 1305,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.53484553,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "1",
                "_score": 0.53484553,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith",
                    "age": 25,
                    "about": "I love to go rock climbing",
                    "interests": [
                        "sports",
                        "music"
                    ]
                },
                "highlight": {
                    "about": [
                        "I love to go <em>rock</em> <em>climbing</em>"
                    ]
                }
            }
        ]
    }
}

当我们运行这个语句时，会命中与之前相同的结果，但是在返回结果中会有一个新的部分叫做 highlight ，这里包含了来自 about 字段中的文本，并且用 <em></em> 来标识匹配到的单词。

目录CONTENT

全文检索引擎ElasticSearch介绍及简单使用

介绍

简介