概要

Bool QueryとDis Max Queryの違いが曖昧だったのでちゃんと調べました。

環境

Ubuntu 14.04
Elasticsearch 2.2.0

データ投入

curl -s -XPOST localhost:9200/my_index/my_type/_bulk -d '
{"index": {"_id": "1"}}
{"title": "Quick brown rabbits", "body": "Brown rabbits are commonly seen."}
{"index": {"_id": "2"}}
{"title": "Keeping pets healthy", "body": "My quick brown fox eats rabbits on a regular basis."}
'

上記のデータに対し、Brown foxというクエリを投げます。
人間的な感覚だとdocument2の方がスコアが高くなると期待します。

Bool Query

ロジックの説明

boolクエリは以下のように計算されます。

各クエリを実行
各クエリのスコアを加算
マッチしたクエリ数をかける
全クエリ数で割る

実行クエリ

curl localhost:9200/my_index/my_type/_search?pretty -d '
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" } },
                { "match": { "body": "Brown fox" } }
            ]
        }
    },
    "size": 5,
    "from": 0
}
'

ちなみに上のクエリは以下のmost_fieldsを使ったmulti_matchクエリと同等です。

curl localhost:9200/my_index/my_type/_search?pretty -d '
{
  "query": {
    "multi_match": {
      "query": "Brown fox",
      "type": "most_fields",
      "fields": ["title", "body"]
    }
  },
  "size": 5,
  "from": 0
}
'

実行結果

document1が高い

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.029836398,
    "hits" : [ {
      "_index" : "my_index",
      "_type" : "my_type",
      "_id" : "1",
      "_score" : 0.029836398,
      "_source" : {
        "title" : "Quick brown rabbits",
        "body" : "Brown rabbits are commonly seen."
      }
    }, {
      "_index" : "my_index",
      "_type" : "my_type",
      "_id" : "2",
      "_score" : 0.01989093,
      "_source" : {
        "title" : "Keeping pets healthy",
        "body" : "My quick brown fox eats rabbits on a regular basis."
      }
    } ]
  }
}

document1は各クエリのスコアが加算されます。一方でdocument2は一致率のスコアは高いですが、マッチしたクエリ数少ないため除算でスコアが減ってしまい、結果としてdocument1の方がスコアが高くなってしまいました。

Dis Max Query

ロジックの説明

各クエリを実行
最もマッチしたクエリのスコアを返す

実行クエリ

curl localhost:9200/my_index/my_type/_search?pretty -d '
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" } },
                { "match": { "body": "Brown fox" } }
            ]
        }
    },
    "size": 5,
    "from": 0
}
'

ちなみに上のクエリは以下のbest_fieldsを使ったmulti_matchクエリと同等です。

curl localhost:9200/my_index/my_type/_search?pretty -d '
{
  "query": {
    "multi_match": {
      "query": "Brown fox",
      "type": "best_fields",
      "fields": ["title", "body"]
    }
  },
  "size": 5,
  "from": 0
}
'

実行結果

document2が高い

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.04161264,
    "hits" : [ {
      "_index" : "my_index",
      "_type" : "my_type",
      "_id" : "2",
      "_score" : 0.04161264,
      "_source" : {
        "title" : "Keeping pets healthy",
        "body" : "My quick brown fox eats rabbits on a regular basis."
      }
    }, {
      "_index" : "my_index",
      "_type" : "my_type",
      "_id" : "1",
      "_score" : 0.02250402,
      "_source" : {
        "title" : "Quick brown rabbits",
        "body" : "Brown rabbits are commonly seen."
      }
    } ]
  }
}

boolのように各クエリをベースに加算や除算といった計算はされず、一番ヒットしたスコアを返すため、期待通りdocument2が返りました。

Carpe Diem

備忘録

Bool Query と Dis Max Query の違い

概要

環境

データ投入

Bool Query

ロジックの説明

実行クエリ

実行結果

Dis Max Query

ロジックの説明

実行クエリ

実行結果

ソース