3-5-4. Divide and Conquer

2.X/3. Dealing with Human Language

3-5-4. Divide and Conquer

drscg 2017. 9. 24. 12:58

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.

query string에서 단어는, 더 중요하고(낮은 빈도의), 덜 중요한(높은 빈도) 단어로 나누어진다. 덜 중요한 단어만이 일치하는 document는 아마도 관심이 아주 적을 것이다. 실제로, 가능한 한 더 중요한 단어가 많이 일치하는 document를 원한다.

The match query accepts a cutoff_frequency parameter, which allows it to divide the terms in the query string into a low-frequency and high-frequency group. The low-frequency group (more-important terms) form the bulk of the query, while the high-frequency group (less-important terms) is used only for scoring, not for matching. By treating these two groups differently, we can gain a real boost of speed on previously slow queries.

match query는, query string에 있는 단어를 낮은 빈도와 높은 빈도의 그룹으로 나눌 수 있은,cutoff_frequency 매개변수를 가진다. 낮은 빈도의 그룹(더 중요한 단어들)은 query의 대부분을 형성하고, 반면에 높은 빈도의 그룹(덜 중요한 단어들)은, 일치를 위해서가 아닌, score 계산을 위해서만 사용된다. 이렇게 두 그룹을 다르게 처리함으로써, 이전에는 느렸던 query를, 진짜로 빠르게 만들 수 있다.

Domain-Specific Stopwords

One of the benefits of cutoff_frequency is that you get domain-specific stopwords for free.For instance, a website about movies may use the words movie, color, black, and white so often that they could be considered almost meaningless. With the stop token filter, these domain-specific terms would have to be added to the stopwords list manually. However, because the cutoff_frequency looks at the actual frequency of terms in the index, these words would be classified as high frequency automatically.

cutoff_frequency 의 이점 중 하나는 특정 영역(domain-specific) 에 대한 불용어 지정이 자유롭다는 점이다. 예를 들어, 영화에 대한 website는 movie, color, black, white 라는 단어를 자주 사용하는데, 거의 의미가 없다고 간주된다. stop token filter를 사용하여, 이러한 특정 영역의 단어를 수동으로 불용어 목록에 추가할 수 있다. 그러나, cutoff_frequency 는 index에서 단어의 실제 빈도를 확인하기 때문에, 이런 단어들은 자동으로 높은 빈도 로 분류된다.

Take this query as an example:

예제로 아래 query를 살펴보자.

{
  "match": {
    "text": {
      "query": "Quick and the dead",
      "cutoff_frequency": 0.01 
    }
}

1% 이상의 document에서 나타나는 모든 단어들은 높은 빈도로 간주된다. cutoff_frequency 는 비율(0.01)이나 document의 수(5)로 지정한다.

This query uses the cutoff_frequency to first divide the query terms into a low-frequency group (quick, dead) and a high-frequency group (and, the). Then, the query is rewritten to produce the following bool query:

이 query는 먼저, query 단어를 낮은 빈도의 그룹(quick, dead)과 높은 빈도의 그룹(and, the)으로 나누기 위해, cutoff_frequency 를 사용한다. 그리고 나서, 이 query는 아래의 bool query로 다시 작성된다.

{
  "bool": {
    "must": { 
      "bool": {
        "should": [
          { "term": { "text": "quick" }},
          { "term": { "text": "dead"  }}
        ]
      }
    },
    "should": { 
      "bool": {
        "should": [
          { "term": { "text": "and" }},
          { "term": { "text": "the" }}
        ]
      }
    }
  }
}

	최소한 낮은 빈도(높은 중요도)의 단어는 반드시 일치
	높은 빈도(낮은 중요도)의 단어는 전부 선택적

The must clause means that at least one of the low-frequency terms—quick or dead—_must_ be present for a document to be considered a match. All other documents are excluded. The shouldclause then looks for the high-frequency terms and and the, but only in the documents collected by the must clause. The sole job of the should clause is to score a document like "Quick and the dead" higher than "The quick but dead". This approach greatly reduces the number of documents that need to be examined and scored.

must 절은, 일치하는 것으로 간주되기 위해서, 낮은 빈도의 단어(quick 또는 dead) 중 최소한 하나는 반드시 document에 존재해야 한다는 것을 의미한다. 다른 모든 document는 제외된다. should 절은 must 절에서 수집한 document에서만, 높은 빈도의 단어 and, the 를 찾는다. should 절의 유일한 작업은 "Quick AND the dead" 같은 document가 "The quick but dead" 보다 더 높다는 식의 score를 계산하는 것이다. 이런 방법은, 확인하고 score를 계산할, document의 수를 크게 줄인다.

Setting the operator parameter to and would make all low-frequency terms required, and score documents that contain all high-frequency terms higher. However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be required, you should use a bool query instead. As we saw in and Operator, this is already an efficient query.

operator 매개변수에 and 를 설정하는 것은, 낮은 빈도의 단어 모두 가 필요하도록 하고, 높은 빈도의 단어 모두 를 포함하는 document의 score를 더 높게 하겠다는 의미이다. 그러나 일치하는 document가 높은 빈도의 단어 모두를 포함할 필요는 없다. 낮은 빈도와 높은 빈도의 단어 모두가 필요한 것을 선호한다면, 대신 bool query를 사용할 수 있다. and Operator에서 보았듯이, 이것은 이미 효율적인 query이다.

Controlling Precisionedit

The minimum_should_match parameter can be combined with cutoff_frequency but it applies to only the low-frequency terms. This query:

minimum_should_match 매개변수는 cutoff_frequency 와 조합될 수 있다. 단, 낮은 빈도의 단어에서만 적용한다. 아래 query는

{
  "match": {
    "text": {
      "query": "Quick and the dead",
      "cutoff_frequency": 0.01,
      "minimum_should_match": "75%"
    }
}

would be rewritten as follows:

다음처럼 재 작성된다.

{
  "bool": {
    "must": {
      "bool": {
        "should": [
          { "term": { "text": "quick" }},
          { "term": { "text": "dead"  }}
        ],
        "minimum_should_match": 1 
      }
    },
    "should": { 
      "bool": {
        "should": [
          { "term": { "text": "and" }},
          { "term": { "text": "the" }}
        ]
      }
    }
  }
}

	2개의 단어만 있기 때문에, 원래의 75%는 `1` 이 된다. 즉, 낮은 빈도의 단어 둘 중의 하나는 반드시 일치해야 한다.
	높은 빈도의 단어는 여전히 선택적이고, score 계산에만 사용된다.

Only High-Frequency Termsedit

An or query for high-frequency terms only—"To be, or not to be"—is the worst case for performance. It is pointless to score all the documents that contain only one of these terms in order to return just the top 10 matches. We are really interested only in documents in which the terms all occur together, so in the case where there are no low-frequency terms, the query is rewritten to make all high-frequency terms required:

오로지 높은 빈도의 단어에 대한 or query(예: "to be, or not to be")는, 성능에 있어서 가장 안 좋은 경우이다. 일치하는 상위 10개를 반환하기 위해, 이런 단어 중 하나만을 포함하는 document 모두의 score를 계산하는 것은 무의미하다. 우리는 그들 모두가 함께 나타나는 document에게만, 실제로 관심이 있다. 따라서 낮은 빈도의 단어가 없는 경우, query는 높은 빈도의 단어 모두가 요구되는 것으로, 다시 작성된다.

{
  "bool": {
    "must": [
      { "term": { "text": "to" }},
      { "term": { "text": "be" }},
      { "term": { "text": "or" }},
      { "term": { "text": "not" }},
      { "term": { "text": "to" }},
      { "term": { "text": "be" }}
    ]
  }
}

More Control with Common Termsedit

While the high/low frequency functionality in the match query is useful, sometimes you want more control over how the high- and low-frequency groups should be handled. The match query exposes a subset of the functionality available in the common terms query.

match query에서 높은/낮은 빈도라는 기능은 유용하지만, 가끔은 높은 또는 낮은 빈도의 그룹을 처리하는 방법을 추가로 제어해야 하는 경우도 있다. match query는 common term query에서 이용할 수 있는 기능의 일부를 보여준다.

For instance, we could make all low-frequency terms required, and score only documents that have 75% of all high-frequency terms with a query like this:

예를 들어, 낮은 빈도의 단어 모두가 필요하고, 높은 빈도의 단어 모두 중 75%를 가진 document의 score를 계산해야 한다면, query는 아래와 같다.

{
  "common": {
    "text": {
      "query":                  "Quick and the dead",
      "cutoff_frequency":       0.01,
      "low_freq_operator":      "and",
      "minimum_should_match": {
        "high_freq":            "75%"
      }
    }
  }
}

See the common terms query reference page for more options.

추가 옵션은 common terms query 를 참고하자.

저작자표시 비영리 변경금지

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

3-5-2. Using Stopwords (0)	2017.09.24
3-5-3. Stopwords and Performance (0)	2017.09.24
3-5-5. Stopwords and Phrase Queries (0)	2017.09.24
3-5-6. common_grams Token Filter (0)	2017.09.24
3-5-7. Stopwords and Relevance (0)	2017.09.24

현재글3-5-4. Divide and Conquer

elasticsearch, definitive guide

MATCH, Type, Filter, parent, index, cache, score, Shard, full-text, Query, phrase, replica, json, Size, inverted, Term, Relevance, Mapping, Cluster, primary,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

不爲也比不能也