2-2-7. Controlling Analysis

2.X/2. Search in Depth

2-2-7. Controlling Analysis

drscg 2017. 9. 30. 01:26

Queries can find only terms that actually exist in the inverted index, so it is important to ensure that the same analysis process is applied both to the document at index time, and to the query string at search time so that the terms in the query match the terms in the inverted index.

query는 inverted index에 실제로 존재하는 단어만을 찾을 수 있다. 따라서 document를 index할할 때 적용되는 프로세스와, 검색 시 query string에 적용되는 프로세스가, 동일한 analysis 프로세스라는 것을 보장하는 것은 중요하다. 그래야 query의 단어와 inverted index의 단어가 일치하게 된다.

Although we say document, analyzers are determined per field. Each field can have a different analyzer, either by configuring a specific analyzer for that field or by falling back on the type, index, or node defaults. At index time, a field’s value is analyzed by using the configured or default analyzer for that field.

document 라고 말했지만, analyzer는 각 field별로 결정된다. 각 field는, 해당 field에 설정된 특정 analyzer나, type, index, node의 기본 analyzer에 따라, 서로 다른 analyzer를 가질 수 있다. index시에, field의 값은 해당 field에 설정된 혹은 기본 analyzer로 분석된다.

For instance, let’s add a new field to my_index:

예를 들어, my_index 에 새로운 field를 추가해 보자.

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "english_title": {
                "type":     "string",
                "analyzer": "english"
            }
        }
    }
}

COPY AS CURL VIEW IN SENSE

Now we can compare how values in the english_title field and the title field are analyzed at index time by using the analyze API to analyze the word Foxes:

이제, Foxes 라는 단어를 분석하기 위하여, analyze API를 이용하여, english_title field와 title field에 있는 값이, index시에 분석되는 방법을 비교할 수 있다.

GET /my_index/_analyze
{
  "field": "title",   
  "text": "Foxes"
}

GET /my_index/_analyze
{
  "field": "english_title",   
  "text": "Foxes"
}

COPY AS CURL VIEW IN SENSE

	`title` field는 기본인 `standard` analyzer를 사용한다. 그래서, `foxes` 라는 단어를 반환한다.
	`english_title` field는 `english` analyzer를 사용한다. 그래서, `fox` 라는 단어를 반환한다.

This means that, were we to run a low-level term query for the exact term fox, the english_titlefield would match but the title field would not.

즉, fox 라는 정확한(exact) 단어에 대한 low-level term query를 실행해 보면, english_title field는 일치하나, title field는 그렇지 않을 것이다.

High-level queries like the match query understand field mappings and can apply the correct analyzer for each field being queried. We can see this in action with the validate-query API:

match query 같은 high-level query는 field mapping을 이해한다. 따라서 query시에 각 field에 대해 올바른 analyzer를 적용할 수 있다. validate-query API를 통해, 이것의 동작을 알 수 있다.

GET /my_index/my_type/_validate/query?explain
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title":         "Foxes"}},
                { "match": { "english_title": "Foxes"}}
            ]
        }
    }
}

COPY AS CURL VIEW IN SENSE

which returns this explanation:

위 query는 아래와 같은 explanation 을 반환한다.

(title:foxes english_title:fox)

The match query uses the appropriate analyzer for each field to ensure that it looks for each term in the correct format for that field.

match query는, 각 단어를 해당 filed에 맞는 형식으로 찾는다는 것을 보장하기 위해, 각 field에 대해 적절한 analyzer를 사용한다.

Default Analyzersedit

While we can specify an analyzer at the field level, how do we determine which analyzer is used for a field if none is specified at the field level?

field 단계에서 analyzer를 지정할 수 있는데, 만약 field 단계에 아무것도 지정되어 있지 않다면, 해당 field에 어떤 analyzer가 사용될지를 어떻게 결정할까?

Analyzers can be specified at three levels: per-field, per-index or the global default. Elasticsearch works through each level until it finds an analyzer that it can use. At index time, the order is as follows:

analyzer는 3 단계(field 별, index 별 또는 기본 값인 global)에서 지정될 수 있다. Elasticsearch는 사용할 수 있는 analyzer를 찾을 때까지, 각 단계를 거친다. index 시에, 그 순서는 아래와 같다.

The analyzer defined in the field mapping, else
field mapping에 정의된 analyzer, 없으면
The analyzer named default in the index settings, which defaults to
index 설정에 default 라 이름 붙여진 analyzer, 기본 값은
The standard analyzer
standard analyzer

At search time, the sequence is slightly different:

검색 시에는 절차가 약간 다르다.

The analyzer defined in the query itself, else
query 자체에 정의된 analyzer, 없으면
The analyzer defined in the field mapping, else
field mapping에 정의된 analyzer, 없으면
The analyzer named default in the index settings, which defaults to
index 설정에 default 라 이름 붙여진 analyzer, 기본 값은
The standard analyzer
standard analyzer

Occasionally, it makes sense to use a different analyzer at index and search time. For instance, at index time we may want to index synonyms (for example, for every occurrence of quick, we also index fast, rapid, and speedy). But at search time, we do not need to search for all of these synonyms. Instead we can just look up the single word that the user has entered, be it quick, fast, rapid, or speedy.

가끔씩, index 시와 와 검색 시에, 서로 다른 analyzer를 사용해야 할 경우가 있다. 예를 들어, index시에 동의어를 색인해야 하는 경우(예: quick 의 경우 fast, rapid, speedy 도 색인)가 있다. 그러나, 검색 시에 이런 동의어 모두를 찾을 필요는 없다. 대신 사용자가 입력한 quick, fast, rapid, speedy 등의 단어 하나만 찾으면 된다.

To enable this distinction, Elasticsearch also supports an optional search_analyzer mapping which will only be used at search-time (analyzer is still used for indexing). There is also an equivalent default_search mapping for configuring the default at the index-level.

이 차이점을 활성화하기 위해서, Elasticsearch는 검색시에만 사용될 수 있는 선택적인 search_analyzermapping을 지원한다. (analyzer 는 색인을 위해 여전히 사용된다) index-level에서 기본값으로 설정할 수 있는 동등한 default_search mapping 또한 있다.

Taking these extra parameters into account, the full sequence at search time:

이 추가 매개변수를 고려하면, search 시의 전체 절차는 실제로 아래와 같다.

The analyzer defined in the query itself, else
query 자체에 정의된 analyzer, 없으면
The search_analyzer defined in the field mapping, else
field mapping에 정의된 search_analyzer, 없으면
The analyzer defined in the field mapping, else
field mapping에 정의된 analyzer, 없으면
The analyzer named default_search in the index settings, which defaults to
index 설정에 default_search 라 이름 붙여진 analyzer, 기본 값은
The analyzer named default in the index settings, which defaults to
index 설정에 default 라 이름 붙여진 analyzer, 기본 값은
The standard analyzer
standard analyzer

Configuring Analyzers in Practiceedit

The sheer number of places where you can specify an analyzer is quite overwhelming. In practice, though, it is pretty simple.

analyzer를 지정할 수 있는 곳의 수 만으로도, 완전히 압도당하는 듯하다. 하지만 실제로는 꽤 간단하다.

Keep it simpleedit

Most of the time, you will know what fields your documents will contain ahead of time. The simplest approach is to set the analyzer for each full-text field when you create your index or add type mappings. While this approach is slightly more verbose, it enables you to easily see which analyzer is being applied to each field.

대부분은, document의 field가 어떤 값을 가지고 있는지, 미리 알고 있을 것이다. 가장 간단한 방법은, index를 생성하거나, type mapping를 추가할 때, 각 full-text field에 analyzer를 설정하는 것이다. 이 방식은 다소 장황하자만, 각 field에 어떤 analyzer가 적용되었는지 쉽게 알 수 있다.

Typically, most of your string fields will be exact-value not_analyzed fields such as tags or enums, plus a handful of full-text fields that will use some default analyzer like standard or english or some other language. Then you may have one or two fields that need custom analysis: perhaps the titlefield needs to be indexed in a way that supports find-as-you-type.

일반적으로, string field의 대부분은, tag나 열거형 같은, exact-value not-analyzed field일 것이고, standard, english 등의 몇 가지 기본 analyzer를 사용하는, full-text field가 조금 더 있을 것이다. 그리고 사용자 정의 analysis가 필요한 한두 개의 field가 있을 것이다. 아마도 title field는 입력과 동시에 찾는(find-as-you-type) 방식으로 색인되어야 할 것이다.

You can set the default analyzer in the index to the analyzer you want to use for almost all full-text fields, and just configure the specialized analyzer on the one or two fields that need it.

대부분 모든 full-text field에 사용할 analyzer로, index에 default analyzer를 설정할 수 있고, 필요한 한두 개의 field에, 특별한 analyzer를 설정할 수 있다.

A common work flow for time based data like logging is to create a new index per day on the fly by just indexing into it. While this work flow prevents you from creating your index up front, you can still use index templates to specify the settings and mappings that a new index should have.

logging 처럼 시간 기반 데이터의 일반적인 작업 흐름은, 데이터를 색인하면서, 바로 새로운 일별 index를 생성해야 한다. 이런 작업 흐름은 index를 미리 생성할 수 없으므로, 새로운 index가 가져야 하는 설정과 mappings를 지정하기 위해, index templates을 사용할 수 있다.

'2.X > 2. Search in Depth' 카테고리의 다른 글

2-3-5. How match Uses bool (0)	2017.09.30
2-3-6. Boosting Query Clauses (0)	2017.09.30
2-2-8. Relevance Is Broken! (0)	2017.09.30
2-3. Multifield Search (0)	2017.09.30
2-3-01. Multiple Query Strings (0)	2017.09.30

현재글2-2-7. Controlling Analysis

elasticsearch, definitive guide

Query, Type, Size, phrase, Cluster, primary, Term, json, cache, Filter, replica, Shard, Relevance, inverted, Mapping, parent, MATCH, full-text, score, index,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

不爲也比不能也