2-6-06. Ignoring TF/IDF

2.X/2. Search in Depth

2-6-06. Ignoring TF/IDF

drscg 2017. 9. 24. 19:24

Sometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:

TF/IDF에 관계없이, 어떤 field에 특정 단어가 나타나는지 알고 싶을 경우가 있다. 휴가용 별장을 검색하고 있고, 가능한 한 많은 기능(wifi, 정원, 풀장 등)을 가진 집을 찾고 있다면 말이다.

WiFi
Garden
Pool

The vacation home documents look something like this:

별장 document는 아래처럼 나타날 것이다.

{ "description": "A delightful four-bedroomed house with ... " }

We could use a simple match query:

간단한 match query를 사용할 수 있다.

GET /_search
{
  "query": {
    "match": {
      "description": "wifi garden pool"
    }
  }
}

However, this isn’t really full-text search. In this case, TF/IDF just gets in the way. We don’t care whether wifi is a common term, or how often it appears in the document. All we care about is that it does appear. In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1, and if it isn’t, 0.

그러나, 이것은 진정한 full-text 검색 이 아니다. 이 경우에, TF/IDF는 방해가 될 뿐이다. wifi 가 document에서 흔한 단어인지, 몇 번 나타나는지는 관심 밖이다. 우리의 관심은 그것이 나타난다는 사실이다. 사실, 가지고 있는 기능의 수를 가지고 집의 순위를 매길 뿐이다. 기능이 많을수록 더 좋다. 기능이 존재하면, score는 1, 그렇지 않으면 0 이다.

constant_score Queryedit

Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of1 to any documents that match, regardless of TF/IDF:

constant_score query를 시작해 보자. 이 query는 query나 filter를 감쌀 수 있고, TF/IDF에 관계없이, 일치하는 모든 document에 score 1 을 할당한다.

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "description": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "garden" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "pool" }}
        }}
      ]
    }
  }
}

Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:

아마도, 모든 기능이 똑같이 중요하지는 않을 것이다. 이용자에게 어떤 것은 다른 것보다 더 중요할 것이다. 가장 중요한 기능이 pool이라고 한다면, 그것을 더 중요하게 만들어, 해당 절에 가중치를 부여할 수 있다.

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "description": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "description": "garden" }}
        }},
        { "constant_score": {
          "boost":   2 
          "query": { "match": { "description": "pool" }}
        }}
      ]
    }
  }
}

pool 절에 일치하면 score에 2 를 추가한다. 반면에 다른 절은 단지 각각 1 을 추가한다.

The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.

각 결과의 최종 score는 단순히 일치하는 모든 절의 score의 합이 아니다. 조정 요소나 query 정규화 요소를 여전히 고려해야 한다.

We could improve our vacation home documents by adding a not_analyzed features field to our vacation homes:

휴가용 별장에 not_analyzed 인 features field를 추가하여, 휴가용 별장 document를 향상시킬 수 있다.

{ "features": [ "wifi", "pool", "garden" ] }

By default, a not_analyzed field has field-length norms disabled and has index_options set to docs, disabling term frequencies, but the problem remains: the inverse document frequency of each term is still taken into account.

기본적으로, not_analyzed field는 field-length norms이 비활성화 되어 있고, index_options 이 docs 로 설정되어 있고, term frequency(TF)도 비활성화 되어 있다. 그러나, 문제는 남아 있다. 각 단어의 inverse document frequency(IDF)는 여전히 고려되고 있다.

We could use the same approach that we used previously, with the constant_score query:

constant_score query를 사용하여, 위에서와 동일한 방법을 사용할 수 있다.

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "features": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "features": "garden" }}
        }},
        { "constant_score": {
          "boost":   2
          "query": { "match": { "features": "pool" }}
        }}
      ]
    }
  }
}

Really, though, each of these features should be treated like a filter. A vacation home either has the feature or it doesn’t—a filter seems like it would be a natural fit. On top of that, if we use filters, we can benefit from filter caching.

그렇지만, 실제로 이런 기능들 각각은 filter처럼 처리된다. 휴가용 별장이 기능을 가지고 있느냐 아니냐의 문제이다. filter에게 꼭 들어맞다. filter를 사용하면, filter caching의 이점이 있다고 언급했었다.

The problem is this: filters don’t score. What we need is a way of bridging the gap between filters and queries. The function_score query does this and a whole lot more.

문제는 filter가 score를 계산하지 않는다는 것이다. 필요한 것은 filter와 query 사이의 차이점을 메우는 것이다. function_score query는 이것 이상의 많은 것을 할 수 있다.

저작자표시 비영리 변경금지 (새창열림)

'2.X > 2. Search in Depth' 카테고리의 다른 글

2-6-04. Manipulating Relevance with Query Structure (0)	2017.09.24
2-6-05. Not Quite Not (0)	2017.09.24
2-6-07. function_score Query (0)	2017.09.24
2-6-08. Boosting by Popularity (0)	2017.09.24
2-6-09. Boosting Filtered Subsets (0)	2017.09.24

현재글2-6-06. Ignoring TF/IDF

elasticsearch, definitive guide

inverted, cache, primary, Filter, score, json, parent, Cluster, Shard, Mapping, Relevance, Term, phrase, MATCH, Size, index, Query, replica, full-text, Type,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

不爲也比不能也