2-3-10. cross-fields Queries

2.X/2. Search in Depth

2-3-10. cross-fields Queries

drscg 2017. 9. 28. 21:46

The custom _all approach is a good solution, as long as you thought about setting it up before you indexed your documents. However, Elasticsearch also provides a search-time solution to the problem: the multi_match query with type cross_fields. The cross_fields type takes a term-centric approach, quite different from the field-centric approach taken by best_fields and most_fields. It treats all of the fields as one big field, and looks for each term in any field.

document를 색인 하기 전에 설정한다면, 사용자 정의 _all 방식은 아주 멋진 해결책이다. 그러나, Elasticsearch는 검색 시에도, cross_fields type을 가진 multi_match query로, 문제에 대한 해결책을 제공한다. cross_fields type은, best_fields 나 most_fields 에서 사용되는 field 중심(field-centric)과는 전혀 다른, 단어 중심(term-centric) 접근 방식이다. field 모두를 하나의 큰 field로 취급하고, 모든(any) field에서 각각의 단어 를 찾는다.

To illustrate the difference between field-centric and term-centric queries, look at the explanationfor this field-centric most_fields query:

field 중심(field-centric)의 query와 단어 중심(term-centric)의 query 사이의 차이점을 분명히 보여주기 위해, field 중심(field-centric) most_fields query의 explanation을 살펴보자.

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "most_fields",
            "operator":    "and", 
            "fields":      [ "first_name", "last_name" ]
        }
    }
}

COPY AS CURL VIEW IN SENSE

모든 단어가 요구된다.

For a document to match, both peter and smith must appear in the same field, either the first_name field or the last_name field:

일치하는 document가 되기 위해서는, peter 와 smith 두 단어 모두가 동일한 field(first_name field나 last_name field)에 반드시 있어야 한다.

(+first_name:peter +first_name:smith)
(+last_name:peter  +last_name:smith)

A term-centric approach would use this logic instead:

단어 중심(term-centric) 방식은 대신 아래 논리를 사용한다.

+(first_name:peter last_name:peter)
+(first_name:smith last_name:smith)

In other words, the term peter must appear in either field, and the term smith must appear in either field.

즉, peter 라는 단어는 어느 하나의 field에 반드시 있어야 한다. 그리고 smith 라는 단어는 어느 하나의 field에 반드시 있어야 한다.

The cross_fields type first analyzes the query string to produce a list of terms, and then it searches for each term in any field. That difference alone solves two of the three problems that we listed in Field-Centric Queries, leaving us just with the issue of differing inverse document frequencies.

cross_fields type은, 단어의 목록을 만들기 위해, 먼저 query string을 분석한다. 그리고 모든 field에서 각 단어를 검색한다. 그 차이점이, Field-Centric Queries에서 나열된 세 가지 문제점 중 두 가지를 해결할 수 있고, IDF의 차이로 발생하는 문제점 한가지만을 남긴다.

Fortunately, the cross_fields type solves this too, as can be seen from this validate-query request:

다행스럽게도, 아래의 validate-query request에서 알 수 있는 바와 같이, cross_fields type은 이 문제 역시 해결한다.

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields", 
            "operator":    "and",
            "fields":      [ "first_name", "last_name" ]
        }
    }
}

COPY AS CURL VIEW IN SENSE

cross_fields type은 단어 중심(term-centric) 일치를 사용한다.

It solves the term-frequency problem by blending inverse document frequencies across fields:

전체 field에 걸쳐 IDF를 _조합(blending)_함으로써, TF 문제점을 해결한다.

+blended("peter", fields: [first_name, last_name])
+blended("smith", fields: [first_name, last_name])

In other words, it looks up the IDF of smith in both the first_name and the last_name fields and uses the minimum of the two as the IDF for both fields. The fact that smith is a common last name means that it will be treated as a common first name too.

즉, first_name 과 last_name field 모두에서 smith 의 IDF를 찾고, 둘 중의 최소값을 두 field의 IDF로 사용한다. smith 가 흔한 last name이라는 사실은, 그것이 흔한 first name으로도 취급된다는 의미이다.

For the cross_fields query type to work optimally, all fields should have the same analyzer. Fields that share an analyzer are grouped together as blended fields.

cross_fields query type이 최적으로 동작하기 위해서는, 모든 field가 동일한 analyzer를 사용해야 한다. 동일한 analyzer를 사용하는 field들은 조합된 field로서 함께 모아진다.

If you include fields with a different analysis chain, they will be added to the query in the same way as for best_fields. For instance, if we added the title field to the preceding query (assuming it uses a different analyzer), the explanation would be as follows:

다른 분석 chain을 가지는 field를 포함하고 있다면, best_fields 와 동일한 방식으로 query에 추가될 것이다. 예를 들어, 다른 analyzer를 사용하는 것으로 가정한, title field를 위의 query에 추가한다면, explanation은 아래 같다.

(+title:peter +title:smith)
(
  +blended("peter", fields: [first_name, last_name])
  +blended("smith", fields: [first_name, last_name])
)

This is particularly important when using the minimum_should_match and operatorparameters.

이것은 minimum_should_match 나 operator 매개변수를 사용하는 경우, 매우 중요하다.

Per-Field Boostingedit

One of the advantages of using the cross_fields query over custom _all fields is that you can boost individual fields at query time.

사용자 정의 _all fields에 cross_fields query를 사용하는 경우의 장점 중의 하나는, 검색 시에 개별 field에 가중치를 줄 수 있다는 것이다.

For fields of equal value like first_name and last_name, this generally isn’t required, but if you were searching for books using the title and description fields, you might want to give more weight to the title field. This can be done as described before with the caret (^) syntax:

first_name 이나 last_name 처럼 동일한 가치를 갖는 field에서는, 일반적으로 이것이 불필요하다. 그러나, title 이나 description field를 사용하여 책을 검색한다면, title field에 더 많은 비중을 주려 할 것이다. 이것은 이전에 언급했던 것처럼, caret(^)으로 가능하다.

GET /books/_search
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title^2", "description" ] 
        }
    }
}

title field는 boost 2 를 가진다. 반면에 description field는 기본 가중치 1 을 가진다.

The advantage of being able to boost individual fields should be weighed against the cost of querying multiple fields instead of querying a single custom _all field. Use whichever of the two solutions that delivers the most bang for your buck.

개별 field에 가중치를 줄 수 있다는 장점은, 하나의 사용자 정의 _all field를 검색하는 비용과 다중 field를 검색하는 비용을 비교 검토할 수 있다는 것이다. 효과 대비 성능을 검토해 보고, 둘 중 하나를 사용하자.

저작자표시 비영리 변경금지 (새창열림)

'2.X > 2. Search in Depth' 카테고리의 다른 글

2-3-08. Field-Centric Queries (0)	2017.09.28
2-3-09. Custom _all Fields (0)	2017.09.28
2-3-11. Exact-Value Fields (0)	2017.09.28
2-4. Proximity Matching (0)	2017.09.24
2-4-1. Phrase Matching (0)	2017.09.24

현재글2-3-10. cross-fields Queries

elasticsearch, definitive guide

inverted, Relevance, Term, Shard, parent, Mapping, phrase, primary, cache, Query, json, score, index, Filter, Type, Cluster, Size, full-text, replica, MATCH,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

不爲也比不能也