2-6-02. Lucene’s Practical Scoring Function

2.X/2. Search in Depth

2-6-02. Lucene’s Practical Scoring Function

drscg 2017. 9. 24. 19:36

For multiterm queries, Lucene takes the Boolean model, TF/IDF, and the vector space model and combines them in a single efficient package that collects matching documents and scores them as it goes.

다중 단어 query의 경우, Lucene은 Boolean model, TF/IDF 그리고, vector space model을 가지고, 일치하는 document를 수집하고, score를 계산하는, 하나의 효율적인 패키지로 그들을 조합한다.

A multiterm query like

다중 단어 query는

GET /my_index/doc/_search
{
  "query": {
    "match": {
      "text": "quick fox"
    }
  }
}

is rewritten internally to look like this:

내부적으로 아래와 같이, 다시 작성된다.

GET /my_index/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": { "text": "quick" }},
        {"term": { "text": "fox"   }}
      ]
    }
  }
}

The bool query implements the Boolean model and, in this example, will include only documents that contain either the term quick or the term fox or both.

bool query는 boolean model을 구현하는데, 이 예제에서는, 단어 quick 이나 fox 를 가지고 있는 document만을 포함한다.

As soon as a document matches a query, Lucene calculates its score for that query, combining the scores of each matching term. The formula used for scoring is called the practical scoring function. It looks intimidating, but don’t be put off—most of the components you already know. It introduces a few new elements that we discuss next.

document가 query에 일치하면, Lucene은 각각 일치하는 단어의 score를 조합하여, 해당 query에 대한 score를 계산한다. score를 계산하는데 사용되는 수식을 Practical Scoring Function 이라 한다. 약간 어려워 보이지만, 그렇지 않다. 대부분의 구성요소는 이미 알고 있다. 아래에서 몇 가지 새로운 요소를 소개하겠다.

score(q,d)  =  
            queryNorm(q)  
          · coord(q,d)    
          · ∑ (           
                tf(t in d)   
              · idf(t)²      
              · t.getBoost() 
              · norm(t,d)    
            ) (t in q)

	`score(q, d)` 는 query `q` 에 대한 document `d` 의 relevance score이다.
	`queryNorm(q)` 는 query 정규화 요소이다. (new)
	`coord(q, d)` 는 조정(coordination) 요소이다. (new)
	document `d` 에 대한 query `q` 에서 각 단어 `t` 에 대한 가중치의 합이다.
	`tf(t in d)` 는 document `d` 의 단어 `t` 에 대한 term frequency이다.
	`idf(t)` 는 단어 `t` 에 대한 inverse document frequency이다.
	`t.getBoost()` 는 query에 적용된 가중치이다. (new)
	`norm(t, d)` 는, 색인 시 field level에서 부여된 가중치가 있다면, 그것으로 조합된, field-length norm이다. (new).

You should recognize score, tf, and idf. The queryNorm, coord, t.getBoost, and norm are new.

score, tf, idf 는 알고 있을 것이다. queryNorm, coord, t.getBoost 그리고 norm 은 새로운 개념이다.

We will talk more about query-time boosting later in this chapter, but first let’s get query normalization, coordination, and index-time field-level boosting out of the way.

이 장의 후반부에서, query시 가중치 부여에 대해 더 이야기할 것이다. 그러나, 먼저 query 정규화, 조정(coordination) 그리고 색인 시의 field level 가중치 부여에 대해 이야기해 보자.

Query Normalization Factoredit

The query normalization factor (queryNorm) is an attempt to normalize a query so that the results from one query may be compared with the results of another.

query 정규화 요소(Query Normalization Factor, queryNorm)는, 어떤 query에서 나온 결과를 다른 결과와 비교하기 위해, query를 정규화(normalize) 하려는 시도이다.

Even though the intent of the query norm is to make results from different queries comparable, it doesn’t work very well. The only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries.

query norm의 목적은 다른 query의 결과와 비교하는 것이지만, 잘 동작하지 않는다. 실제로, relevance _score 의 유일한 목적은 현재 query의 결과를 올바른 순서로 정렬하는 것이다. 다른 query의 relevance score와 비교하려 해서는 안 된다.

This factor is calculated at the beginning of the query. The actual calculation depends on the queries involved, but a typical implementation is as follows:

이 요소는 query의 시작 부분에서 계산된다. 실제 계산은 포함된 query에 따라 다르지만, 전형적인 구현은 아래와 같다.

queryNorm = 1 / √sumOfSquaredWeights

sumOfSquaredWeights 는 query에 있는 각 단어의 IDF를 모두 더한 후 제곱근을 구함으로써 계산된다.

The same query normalization factor is applied to every document, and you have no way of changing it. For all intents and purposes, it can be ignored.

동일한 query의 정규화 요소는 모든 document에 적용되고, 그것을 변경할 수 있는 방법은 없다. 사실상, 그것은 무시된다.

Query Coordinationedit

The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

조정 요소 (Coordination Factor, coord)는, 높은 비율의 검색어를 포함하고 있는, document에 대한 보상으로 사용된다. document에 나타나는 검색어가 많을 수록, document가 query에 잘 일치할 가능성이 더 커진다.

Imagine that we have a query for quick brown fox, and that the weight for each term is 1.5. Without the coordination factor, the score would just be the sum of the weights of the terms in a document. For instance:

quick brown fox 를 검색하는데, 각 단어의 비중이 1.5라고 가정해 보자. 조정요소가 없다면, score는 document에 있는 단어의 비중의 합일 것이다. 예를 들면,

Document with fox → score: 1.5
Document with quick fox → score: 3.0
Document with quick brown fox → score: 4.5

The coordination factor multiplies the score by the number of matching terms in the document, and divides it by the total number of terms in the query. With the coordination factor, the scores would be as follows:

조정요소는 document에서 일치하는 단어의 수를 score에 곱하고, query에 있는 단어의 총 수로 그것을 나눈다. 조정 요소를 반영한 score는 아래와 같다.

Document with fox → score: 1.5 * 1 / 3 = 0.5
Document with quick fox → score: 3.0 * 2 / 3 = 2.0
Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5

The coordination factor results in the document that contains all three terms being much more relevant than the document that contains just two of them.

조정요소는, 단어 3개 모두를 포함하고 있는 document가, 2개의 단어만을 포함하고 있는 document 보다, 훨씬 더 많이 관련 있다는 결과로 나타난다.

Remember that the query for quick brown fox is rewritten into a bool query like this:

quick brown fox 에 대한 query는 아래의 bool query처럼 다시 작성된다는 점을 기억하자.

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }},
        { "term": { "text": "brown" }},
        { "term": { "text": "fox"   }}
      ]
    }
  }
}

The bool query uses query coordination by default for all should clauses, but it does allow you to disable coordination. Why might you want to do this? Well, usually the answer is, you don’t. Query coordination is usually a good thing. When you use a bool query to wrap several high-level queries like the match query, it also makes sense to leave coordination enabled. The more clauses that match, the higher the degree of overlap between your search request and the documents that are returned.

bool query는, 기본적으로 모든 should 절에 대해, query 조정을 사용한다. 그러나, 조정을 비활성화할 수 있다. 왜 이렇게 해야 하는가? 일반적으로 그 대답은 그러지 않아도 된다 이다. query 조정은 일반적으로 좋은 것이다. match query 같은 여러 개의 high-level query를 감싸는 bool query를 사용하는 경우, 조정을 활성화한 채로 두는 것이 합리적이다. 일치하는 절이 더 많을수록, 검색 request와 반환되는 document가 겹치는 정도가 더 높다.

However, in some advanced use cases, it might make sense to disable coordination. Imagine that you are looking for the synonyms jump, leap, and hop. You don’t care how many of these synonyms are present, as they all represent the same concept. In fact, only one of the synonyms is likely to be present. This would be a good case for disabling the coordination factor:

그러나, 몇 가지 고급 사용 사례에서는, 조정을 비활성화하는 것이 합리적이다. 동의어 jump, leap, hop 을 검색한다고 가정해 보자. 모두 동일한 의미를 나타내는 이들 동의어가 몇 번 나타나는가는 관심이 없다. 사실, 단 하나의 동의어만이 존재할 것이다. 이것은 조정 요소를 비활성화하는 좋은 예가 될 것이다.

GET /_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "term": { "text": "jump" }},
        { "term": { "text": "hop"  }},
        { "term": { "text": "leap" }}
      ]
    }
  }
}

When you use synonyms (see Synonyms), this is exactly what happens internally: the rewritten query disables coordination for the synonyms. Most use cases for disabling coordination are handled automatically; you don’t need to worry about it.

사실, 동의어(Synonyms 참조)를 사용하는 경우, 다시 작성된 query가 동의어에 대해 조정을 비활성화하는 동작이 내부적으로 발생한다. 조정을 비활성화하는 대부분의 사용 사례는 자동으로 처리된다. 이에 대해 걱정할 필요가 없다.

Index-Time Field-Level Boostingedit

We will talk about boosting a field—making it more important than other fields—at query time in Query-Time Boosting. It is also possible to apply a boost to a field at index time. Actually, this boost is applied to every term in the field, rather than to the field itself.

Query-Time Boosting에서, query시에 field에 가중치를 부여(boosting) 하는 것 (다른 field보다 더 중요하게 만드는 것)에 대해 이야기할 것이다. 색인 시에 field에 가중치(boost)를 적용하는 것도 가능하다. 실제로 이 가중치는, field 자체보다는, field에 있는 모든 단어에 적용된다.

To store this boost value in the index without using more space than necessary, this field-level index-time boost is combined with the field-length norm (see Field-length norm) and stored in the index as a single byte. This is the value returned by norm(t,d) in the preceding formula.

필요한 것보다 더 많은 공간을 사용하지 않고, index에 이 가중치 값을 저장하기 위해, 색인 시 field-level 가중치는 field length norm과 조합(Field-length norm 참고)되고, index에 단일 byte로 저장된다. 이것이 위의 수식에서 norm(t, d) 에서 반환되는 값이다.

We strongly recommend against using field-level index-time boosts for a few reasons:

몇 가지 이유 때문에, 색인 시에 field-level 가중치를 사용하는 것을 강력하게 추천한다.

Combining the boost with the field-length norm and storing it in a single byte means that the field-length norm loses precision. The result is that Elasticsearch is unable to distinguish between a field containing three words and a field containing five words.
가중치를 field-length norm과 조합하고, 그것을 단일 byte로 저장하는 것은 field-length norm이 정확성을 잃는다는 것을 의미한다. 결론적으로 Elasticsearch는 3 단어를 가진 field와 5 단어를 가진 field를 구분할 수 없다.
To change an index-time boost, you have to reindex all your documents. A query-time boost, on the other hand, can be changed with every query.
색인 시 가중치를 변경하려면, document 모두를 다시 색인 해야 한다. 반면에, query시 가중치는 각 query마다 바꿀 수 있다.
If a field with an index-time boost has multiple values, the boost is multiplied by itself for every value, dramatically increasing the weight for that field.
색인 시 가중치를 가진 field가 다중 값이라면, 가중치는 모든 값에 대해 자신과 곱한다. 해당 field의 비중을 급격히 증가시킨다.

Query-time boosting is a much simpler, cleaner, more flexible option.

query시 가중치 부여가 훨씬 더 간단하고 깔끔하고, 더 유연한 옵션이다.

With query normalization, coordination, and index-time boosting out of the way, we can now move on to the most useful tool for influencing the relevance calculation: query-time boosting.

query 정규화, 조정과 색인 시 가중치 부여가 끝이 났으니, relevance 계산에 영향을 끼치는 가장 유용한 도구인, query시 가중치 부여에 대해 알아보자.

저작자표시 비영리 변경금지

'2.X > 2. Search in Depth' 카테고리의 다른 글

2-6. Controlling Relevance (0)	2017.09.24
2-6-01. Theory Behind Relevance Scoring (0)	2017.09.24
2-6-03. Query-Time Boosting (0)	2017.09.24
2-6-04. Manipulating Relevance with Query Structure (0)	2017.09.24
2-6-05. Not Quite Not (0)	2017.09.24

현재글2-6-02. Lucene’s Practical Scoring Function

elasticsearch, definitive guide

cache, index, inverted, full-text, Query, Cluster, parent, replica, json, primary, Term, Relevance, phrase, Mapping, Filter, Size, Shard, MATCH, Type, score,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

不爲也比不能也