一、elasticsearch智能推荐简介

elasticsearch作为一个流行搜索引擎,通过用户输入的关键字来寻找匹配的文档,以便用户触达想要的信息;而推荐系统也是类似的处理过程,其首先拿到一个可以表征用户或者物品的数据记录,然后找到跟此记录最接近的记录推荐给用户;

the more link this query查询与给定文档类似的文档,其首先选择一些可以代表输入文档的关键字,然后使用这些关键词构造查询语句,最后在索引中查找相似的文档;

elasticsearch提供的more line this query就是一个基于文档相似性的简单的推荐系统实现,其基于elasticsearch底层的倒排索引及文档相关度算法实现的;

二、数据准备

elasticsearch 6.8
index books

PUT books
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_smart"
        }
      }
    }
  }
}

book数据模型

{
    "bookId":"23303789",
    "title":"罪与罚",
    "author":"陀思妥耶夫斯基",
    "version":231868826,
    "format":"epub",
    "type":0,
    "price":21,
    "originalPrice":0,
    "soldout":0,
    "bookStatus":1,
    "payType":4097,
    "intro":"",
    "centPrice":2100,
    "finished":1,
    "maxFreeChapter":9,
    "free":0,
    "mcardDiscount":0,
    "ispub":1,
    "cpid":2571052,
    "publishTime":"2016-10-12 00:00:00",
    "category":"精品小说-世界名著",
    "hasLecture":1,
    "lastChapterIdx":47,
    "paperBook":{
        "skuId":"12075198"
    },
    "newRating":917,
    "newRatingCount":4017,
    "newRatingDetail":{
        "good":3685,
        "fair":280,
        "poor":52,
        "recent":362,
        "title":"神作"
    },
    "finishReading":0
}

将数据索引如elasticsearch

import requests


header ={
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
maxIndex = 0
while maxIndex < 500:
    response = requests.get(url=f'https://127.0.0.1/web/bookListInCategory/all?maxIndex={maxIndex}', headers=header)
    obj = response.json()
    books = obj['books']
    for book in books:
        info = book['bookInfo']
        book_id = info['bookId']
        r = requests.post(url=f'http://127.0.0.1:9200/books/_doc/{book_id}', json=info)

    if len(books) == 20:
        maxIndex += len(books)
    else:
        exit()

三、基于more like this的图书推荐

我可以输入一段长文本,通过图书的intro字段来查找类似的图书;

GET books/_search
{
    "_source": ["bookId","title","author","intro","category","publishTime"], 
    "query": {
        "more_like_this" : {
            "fields" : ["intro"],
            "like" : "入世20年,世界给中国带来了什么?中国给世界带去了什么?从一开始的“狼来了”,忧虑中国的工业内环境会受到致命冲击,到在一个充分竞争的开放市场,中国在全球化中获益良多。当美元的镰刀划过世界的血管,却没能造就一个更强大的美利坚。中国正在逐步融入全球市场,重塑外贸版图,如今已是全球第二大经济体,并且在更多的方面展现出了领导者的姿态。对于领航者而言,前方只有无人区。过去,跟随、复制、拿来主义的追赶模式正在崩坏,创新增量的时代正在到来。中国经济的未来20年,指向何方?"
        }
        
    },
    "size": 3
    
}

我们可以elasticsearch返回了匹配度最高的三本书,其中第一本书就是我们输入的like文本;

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 320,
    "max_score" : 25.865322,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "42752779",
        "_score" : 25.865322,
        "_source" : {
          "publishTime" : "2021-12-01 00:00:00",
          "author" : "《商界》杂志社",
          "intro" : "",
          "title" : "入世20年:中国经济进入“无人区”(《商界》2021年第12期)",
          "category" : "期刊专栏-财经",
          "bookId" : "42752779"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "33810600",
        "_score" : 9.83586,
        "_source" : {
          "publishTime" : "2020-05-01 00:00:00",
          "author" : "史蒂芬·柯维",
          "intro" : "",
          "title" : "高效能人士的七个习惯(30周年纪念版)(全新增订版)",
          "category" : "个人成长-认知思维",
          "bookId" : "33810600"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "42824214",
        "_score" : 8.847967,
        "_source" : {
          "publishTime" : "2021-02-01 00:00:00",
          "author" : "傅莹",
          "intro" : "",
          "title" : "看世界2:百年变局下的挑战和抉择",
          "category" : "",
          "bookId" : "42824214"
        }
      }
    ]
  }
}

我们也可以直接在like中指定具体的某本书[入世20年:中国经济进入“无人区”(《商界》2021年第12期)],来查找跟它类似的图书

GET books/_search
{
    "_source": ["bookId","title","author","intro","category","publishTime"], 
    "query":{
        "more_like_this":{
            "fields":[
                "intro"
            ],
            "like":[
                {
                    "_index":"books",
                    "_id":"42752779"
                }
            ]
        }
    },
    "size":3
}

我们可以看到elasticsearch已经自动排除了当前文档42752779;

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 319,
    "max_score" : 9.83586,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "33810600",
        "_score" : 9.83586,
        "_source" : {
          "publishTime" : "2020-05-01 00:00:00",
          "author" : "史蒂芬·柯维",
          "intro" : "",
          "title" : "高效能人士的七个习惯(30周年纪念版)(全新增订版)",
          "category" : "个人成长-认知思维",
          "bookId" : "33810600"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "42824214",
        "_score" : 8.847967,
        "_source" : {
          "publishTime" : "2021-02-01 00:00:00",
          "author" : "傅莹",
          "intro" : "",
          "title" : "看世界2:百年变局下的挑战和抉择",
          "category" : "",
          "bookId" : "42824214"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "42867305",
        "_score" : 8.118775,
        "_source" : {
          "publishTime" : "2022-01-01 00:00:00",
          "author" : "林毅夫",
          "intro" : "",
          "title" : "中国经济的前景",
          "category" : "经济理财-财经",
          "bookId" : "42867305"
        }
      }
    ]
  }
}

四、more like this工作机制

智能推荐系统的本意就是通过处理计算,找到最相似的东西推荐给用户;elasticsearch的more like this这是基于这个朴素的概念,利用倒排索引的底层数据结构和自己的tf-idf的相关性计算模型,来计算两个文档的相似程度,相关度越高则越相似;

当进行查询的时候,more like this查询首先会使用指定字段的analyzer对传入字符串或者文档的相关字段进行分词,然后根据配置选择其中最能表征当前文档的top n关键字,之后利用这些关键字进行组合查询,寻找类似的文档;

我们可以通过以下查询语句,看下elasticsearch是怎么工作的;

GET books/_search
{
    "profile": "true", 
    "_source": ["bookId","title","author","intro","category","publishTime"], 
    "query":{
        "more_like_this":{
            "fields":[
                "intro"
            ],
            "like":[
                {
                    "_index":"books",
                    "_id":"42752779"
                }
            ]
        }
    },
    "size":3
}

我们可以看到more like this查询最终使用的分词,以及在每个分片上查找相似文档的查询语句;

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "profile" : {
    "shards" : [
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:一个 intro:什么 intro:世界 intro:正在 intro:中国)~1) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 524949
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][1]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:一个 intro:什么 intro:世界 intro:全球 intro:中国)~1) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 444670
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][2]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:一个 intro:什么 intro:全球 intro:世界 intro:带 intro:中国)~1) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 454063                
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][3]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:一个 intro:什么 intro:全球 intro:带 intro:世界 intro:正在 intro:中国)~2) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 430971
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][4]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:一个 intro:什么 intro:世界 intro:全球 intro:中国)~1) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 745984
              }
            ]
          }
        ]
      }
    ]
  }
}

通过以上我们可以看到,elasticsearch基于性能的考虑,默认情况下,选择出来的关键字有点少,通过少量的关键字来表征文档,粒度有点粗丢失了很多的信息,导致推荐效果并不理想;

五、基于more like this参数的推荐优化

通过四中的分析,我们可以看到目前elasticsearch从输入文档中提取的关键字比较少;elasticsearch提供了以下几个参数,来筛选从输入文档提取出来的分词参与查询;

min_term_freq 参与查询的分词term frequency最小值,默认是2;

min_doc_freq 参与查询的分词document frequency的最小值,默认是5;

max_doc_freq参与查询的分词document frequency的最大值,默认不限制;

由于我们的intro字段文本比较小,我们通过min_term_freq=1来让更多的关键字参与查询,同时max_doc_freq = 30排除无意义的分词;

GET books/_search
{
    "profile": "true", 
    "_source": ["bookId","title","author","intro","category","publishTime"], 
    "query":{
        "more_like_this":{
            "fields":[
                "intro"
            ],
            "like":[
                {
                    "_index":"books",
                    "_id":"42752779"
                }
            ],
            "min_term_freq": 1, 
            "max_doc_freq": 30
        }
    },
    "size":3
}

通过分析elasticsearch的返回结果,我们可以看到查询使用的分析和命中的记过都得到了相当程度的改善;

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 31,
    "max_score" : 19.660702,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "42867305",
        "_score" : 19.660702,
        "_source" : {
          "publishTime" : "2022-01-01 00:00:00",
          "author" : "林毅夫",
          "intro" : "",
          "title" : "中国经济的前景",
          "category" : "经济理财-财经",
          "bookId" : "42867305"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "33396343",
        "_score" : 14.927922,
        "_source" : {
          "publishTime" : "2020-01-01 00:00:00",
          "author" : "黄汉城 史哲 林小琬",
          "intro" : "",
          "title" : "中国城市大洗牌",
          "category" : "经济理财-财经",
          "bookId" : "33396343"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "31231802",
        "_score" : 14.390678,
        "_source" : {
          "publishTime" : "2018-02-01 00:00:00",
          "author" : "李光耀",
          "intro" : "",
          "title" : "李光耀观天下(精装版)",
          "category" : "",
          "bookId" : "31231802"
        }
      }
    ]
  },
  "profile" : {
    "shards" : [
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:时代 intro:开始 intro:更 intro:未来 intro:会 intro:展现 intro:出了 intro:什么 intro:世界 intro:正在 intro:中国)~3) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 536546
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][1]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:展现 intro:更 intro:未来 intro:会 intro:出了 intro:开始 intro:只有 intro:时代 intro:来了 intro:方面 intro:什么 intro:世界 intro:全球 intro:中国)~4) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 713527
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][2]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:开始 intro:会 intro:更 intro:只有 intro:对于 intro:来了 intro:未来 intro:时代 intro:过去 intro:出了 intro:模式 intro:什么 intro:全球 intro:世界 intro:带 intro:中国)~4) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 554927
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][3]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:时代 intro:会 intro:更 intro:开始 intro:对于 intro:出了 intro:过去 intro:来了 intro:展现 intro:什么 intro:全球 intro:带 intro:世界 intro:正在 intro:中国)~4) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 678723
              }
            ]
          }
        ]
      },
      {
        "id" : "[OBkTpZcTQJ25kmlNZ6xyLg][books][4]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "((intro:更 intro:会 intro:时代 intro:出了 intro:开始 intro:只有 intro:展现 intro:未来 intro:强大 intro:方面 intro:过去 intro:来了 intro:对于 intro:什么 intro:世界 intro:全球 intro:中国)~5) -ConstantScore(_id:[fe 42 75 27 79])",
                "time_in_nanos" : 877326
              }
            ]
          }
        ]
      }
    ]
  }
}

六、基于业务特点推荐准确性优化

在图书的各个字段中,分类肯定是一个很重要的维度,我来修改查询语句提升具有相同category记录的权重

GET books/_search
{
    "profile":"true",
    "_source":[
        "bookId",
        "title",
        "author",
        "intro",
        "category",
        "publishTime"
    ],
    "query":{
        "bool":{
            "must":[
                {
                    "more_like_this":{
                        "fields":[
                            "intro"
                        ],
                        "like":[
                            {
                                "_index":"books",
                                "_id":"42752779"
                            }
                        ],
                        "min_term_freq":1,
                        "max_doc_freq":30
                    }
                }
            ],
            "should": [
              {
                "match": {
                  "category": {
                    "query": "财经",
                    "boost": 3
                  }
                }
              }
            ],
            "minimum_should_match": 0
        }
    },
    "size":3
}

可看到elasticsearch返回的结果,现在命中的几条记录都是财经类的图书;

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 31,
    "max_score" : 30.677063,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "42867305",
        "_score" : 30.677063,
        "_source" : {
          "publishTime" : "2022-01-01 00:00:00",
          "author" : "林毅夫",
          "intro" : "",
          "title" : "中国经济的前景",
          "category" : "经济理财-财经",
          "bookId" : "42867305"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "33396343",
        "_score" : 24.42368,
        "_source" : {
          "publishTime" : "2020-01-01 00:00:00",
          "author" : "黄汉城 史哲 林小琬",
          "intro" : "",
          "title" : "中国城市大洗牌",
          "category" : "经济理财-财经",
          "bookId" : "33396343"
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "32440996",
        "_score" : 17.54503,
        "_source" : {
          "publishTime" : "2019-06-12 00:00:00",
          "author" : "肖星",
          "intro" : "",
          "title" : "一本书读懂财报(全新修订版)",
          "category" : "经济理财-财经",
          "bookId" : "32440996"
        }
      }
    ]
  }
}

注意:基于当前平台的审核策略,已经将图书intro字段的值抹掉了,完全内容请移步