elasticsearch查询之三种fetch id的方案分析

一、使用场景介绍

elasticsearch除了普通的全文检索之外，在很多的业务场景中都有使用，各个业务模块根据自己业务特色设置查询条件，通过elasticsearch执行并返回所有命中的记录的id；如果命中的记录数达到数万级别的话，查询性能会有明显的下降，尤其是命中超大型的document的时候；

获取记录的id目前可以使用的有三种方式；

通过_source:["id"]

设置_source:false,通过es返回的元数据_id分离出device的id；

使用store=true来单独的存储device id，查询的时候使用stored_fields= ['id']；

二、store映射参数

默认情况下，字段值会被索引以使其可搜索，但不会存储它们。这意味着可以查询该字段，但不能检索原始字段值。

通常这并不重要。该字段值已经是_source字段的一部分，该字段是默认存储的。如果您只想检索单个字段或几个字段的值，而不是整个_source，那么可以通过_source过滤来实现。

在某些情况下，存储字段是有意义的。例如，如果你有一个文档，一个标题，一个日期，和一个非常大的内容字段，你可能想只检索标题和日期，而不必从一个大的_source字段提取这些字段:

设置对应字段的store参数为true，并创建mapping；

PUT my_store_test
{
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}



{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_store_test"
}

put一个document进行索引

PUT my_store_test/_doc/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

{
  "_index" : "my_store_test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

通过在查询语句中设置stored_fields来筛选要返回的字段，elasticsearch返回的fields字段包含对应的字段值；

GET my_store_test/_search
{
  "stored_fields": [ "title", "date" ] 
}


{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_store_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "date" : [
            "2015-01-01T00:00:00.000Z"
          ],
          "title" : [
            "Some short title"
          ]
        }
      }
    ]
  }
}

三、测试情况

我们测试使用my_store_index，里边包含50W的document，还有一些特别大的document；

我们fetch_ids_query进行测试

默认情况下通过elasticsearch查询返回的_source字段获取记录的id字段；

通过take_from__id控制从elasticsearch查询返回的元数据_id解析出记录id；

通过task_stored_fields控制从elasticsearch查询返回的fields获取记录的id；

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
import time


def fetch_ids_query(client, take_from__id = False, task_stored_fields = False):
    start = time.time()
    s = Search(using=client, index="my_store_index")
    s = s.params(http_auth=["test", "test"], request_timeout=50);
    q = Q('bool',
          must_not=[Q('match_phrase_prefix', name='us')]
          )
    s = s.query(q)

    s = s.source(False) if take_from__id else s.source(['id'])
    if task_stored_fields:
        s = s.extra(stored_fields= ['id'])
        s = s.source(False)

    s = s[0:40000]
    response = s.execute()

    print(f'hit total {response.hits.total}')
    print(f'fetch total {len(response.hits.hits)}')
    

    ids = []
    if take_from__id:
        for hit in response.hits.hits:
            id = hit['_id'][37:]
            ids.append(id)
    elif task_stored_fields:
        for hit in response.hits.hits:
            id = hit.fields['id'][0]
            ids.append(id)
    else:
        for hit in response.hits.hits:
            id = hit._source['id']
            ids.append(id)

    end = time.time()
    print(f"all execute time {end - start}s")
    

client = Elasticsearch(hosts=['http://127.0.0.1:9200'], http_auth=["test", "test"])

print('fetch id from source')
fetch_ids_query(client);
print()
print('fetch id from _id and set source = false')
fetch_ids_query(client, True);
print()
print('fetch id from stored id and set source = false')
fetch_ids_query(client, False, True);

四、结果分析

经测试在命中484970，fetch 40000条记录的前提下，后两种方式的执行时间更短，但是通过元数据解析_id会更加友好，不仅节省存储空间，而且查询的时候避免了内存和CPU的震荡；

fetch id from source
hit total 484970
fetch total 40000
all execute time 28.691869497299194s

fetch id from _id and set source = false
hit total 484970
fetch total 40000
all execute time 11.315539121627808s

fetch id from stored id and set source = false
hit total 484970
fetch total 40000
all execute time 13.930094957351685s