2-5-7. Index-Time Search-as-You-Type

2.X/2. Search in Depth

2-5-7. Index-Time Search-as-You-Type

drscg 2017. 9. 24. 20:58

The first step to setting up index-time search-as-you-type is to define our analysis chain, which we discussed in Configuring Analyzers, but we will go over the steps again here.

색인 시에, instant 검색을 설정하기 위한 첫 번째 단계는, analysis chain을 정의하는 것이다. Configuring Analyzers에서 언급했었는데, 여기에서 다시 그 단계를 살펴보겠다.

Preparing the Indexedit

The first step is to configure a custom edge_ngram token filter, which we will call the autocomplete_filter:

첫 번째 단계는 autocomplete_filter 를 호출하는, 사용자 정의 edge_ngram token filter를 설정하는 것이다.

{
    "filter": {
        "autocomplete_filter": {
            "type":     "edge_ngram",
            "min_gram": 1,
            "max_gram": 20
        }
    }
}

This configuration says that, for any term that this token filter receives, it should produce an n-gram anchored to the start of the word of minimum length 1 and maximum length 20.

이 설정은 이 token filter가 받는 모든 단어에 대해, 길이 최소 1, 최대 20자의 단어로 시작하는 고정된 n-gram을 생성할 것이다.

Then we need to use this token filter in a custom analyzer, which we will call the autocompleteanalyzer:

그리고, 이 token filter를 사용자 정의 analyzer에 사용해야 한다. 이를 autocomplete analyzer라 부를 것이다.

{
    "analyzer": {
        "autocomplete": {
            "type":      "custom",
            "tokenizer": "standard",
            "filter": [
                "lowercase",
                "autocomplete_filter" 
            ]
        }
    }
}

사용자 정의 edge-ngram token filter

This analyzer will tokenize a string into individual terms by using the standard tokenizer, lowercase each term, and then produce edge n-grams of each term, thanks to our autocomplete_filter.

이 analyzer는 문자열을 standard tokenizer를 사용하여 개별 단어로 분리하고, 각 단어를 소문자로 바꾼다. 그리고 나서 autocomplete_filter 로, 각 단어의 edge n-grams를 생성한다.

The full request to create the index and instantiate the token filter and analyzer looks like this:

index를 생성하고, token filter와 analyzer를 구성하는, 전체 request는 아래와 같다.

PUT /my_index
{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

COPY AS CURL VIEW IN SENSE

	Relevance Is Broken!를 참조하자.
	먼저, 사용자 정의 token filter를 정의한다.
	그리고, 그것을 analyzer에 사용한다.

You can test this new analyzer to make sure it is behaving correctly by using the analyze API:

그것이 올바르게 동작하는지 확인하기 위해, analyze API를 사용하여, 이 새로운 analyzer를 테스트해 보자.

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "quick brown"
}

COPY AS CURL VIEW IN SENSE

The results show us that the analyzer is working correctly. It returns these terms:

아래의 결과는 analyzer가 올바르게 동작하고 있다는 것을 보여준다. 다음과 같은 단어를 반환한다.

q
qu
qui
quic
quick
b
br
bro
brow
brown

To use the analyzer, we need to apply it to a field, which we can do with the update-mapping API:

analyzer를 사용하기 위해, field에 적용해야 한다. update-mapping API를 사용하면 된다.

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "name": {
                "type":     "string",
                "analyzer": "autocomplete"
            }
        }
    }
}

COPY AS CURL VIEW IN SENSE

Now, we can index some test documents:

이제, 몇 개의 테스트 document를 색인할 수 있다.

POST /my_index/my_type/_bulk
{ "index": { "_id": 1            }}
{ "name": "Brown foxes"    }
{ "index": { "_id": 2            }}
{ "name": "Yellow furballs" }

COPY AS CURL VIEW IN SENSE

Querying the Fieldedit

If you test out a query for "brown fo" by using a simple match query

간단한 match query를 사용하여, "brown fo" 에 대한 query를 테스트해 보자.

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "name": "brown fo"
        }
    }
}

COPY AS CURL VIEW IN SENSE

you will see that both documents match, even though the Yellow furballs doc contains neither brown nor fo:

비록 Yellow furballs document가 brown 과 fo 어느 쪽도 포함하고 있지 않지만, 두 document 모두 일치하는 것을 볼 수 있다.

{

  "hits": [
     {
        "_id": "1",
        "_score": 1.5753809,
        "_source": {
           "name": "Brown foxes"
        }
     },
     {
        "_id": "2",
        "_score": 0.012520773,
        "_source": {
           "name": "Yellow furballs"
        }
     }
  ]
}

As always, the validate-query API shines some light:

언제나처럼, validate-query API는 희망을 줄 것이다.

GET /my_index/my_type/_validate/query?explain
{
    "query": {
        "match": {
            "name": "brown fo"
        }
    }
}

COPY AS CURL VIEW IN SENSE

The explanation shows us that the query is looking for edge n-grams of every word in the query string:

explanation 은 query가 query string에 있는 모든 단어의 edge n-grams를 검색한다는 것을 보여준다.

name:b name:br name:bro name:brow name:brown name:f name:fo

The name:f condition is satisfied by the second document because furballs has been indexed as f, fu, fur, and so forth. In retrospect, this is not surprising. The same autocomplete analyzer is being applied both at index time and at search time, which in most situations is the right thing to do. This is one of the few occasions when it makes sense to break this rule.

두 번째 document는 name:f 라는 조건에 만족한다. 왜냐하면, furballs 는 f, fu, fur 등으로 색인되기 때문이다. 돌이켜 보면, 이는 놀라운 일이 아니다. 동일한 autocomplete analyzer는 색인 시와 검색 시 모두에 적용되었고, 대부분의 상황에서, 이렇게 적용하는 것이 맞다. 이것은 이 규칙을 깨는 것이 합리적인 몇 가지 경우 중의 하나이다.

We want to ensure that our inverted index contains edge n-grams of every word, but we want to match only the full words that the user has entered (brown and fo). We can do this by using the autocomplete analyzer at index time and the standard analyzer at search time. One way to change the search analyzer is just to specify it in the query:

inverted index는 모든 단어의 edge n-grams을 가지고 있어야 하지만, 사용자가 입력한 전체 단어(brown과 fo 같은)에만 일치하기를 바란다. 색인 시에 autocomplete analyzer를, 검색 시에 standard analyzer를 사용함으로써 이를 해결할 수 있다. 검색 analyzer를 바꾸는 한 가지 방법은 단지, query에서 그것을 지정하는 것이다.

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "name": {
                "query":    "brown fo",
                "analyzer": "standard" 
            }
        }
    }
}

COPY AS CURL VIEW IN SENSE

이는 name field에 analyzer 설정을 재정의한다.

Alternatively, we can specify the analyzer and search_analyzer in the mapping for the name field itself. Because we want to change only the search_analyzer, we can update the existing mapping without having to reindex our data:

또는, name field 자체의 mapping에, analyzer 와 search_analyzer 를 지정할 수 있다. search_analyzer만 변경하려 하기 때문에, 데이터를 다시 색인할 필요 없이, 현재의 mapping을 업데이트할 수 있다.

PUT /my_index/my_type/_mapping
{
    "my_type": {
        "properties": {
            "name": {
                "type":            "string",
                "analyzer":  "autocomplete", 
                "search_analyzer": "standard" 
            }
        }
    }
}

COPY AS CURL VIEW IN SENSE

	모든 단어의 edge n-grams를 만들어 내기 위해, 색인 시에 `autocomplete` analyzer를 사용한다.
	사용자가 입력한 단어만 검색하기 위해, 검색 시에 `standard` analyzer를 사용한다.

If we were to repeat the validate-query request, it would now give us this explanation:

validate-query request를 다시 해 보면, 아래와 같은 explanation을 보여줄 것이다.

name:brown name:fo

Repeating our query correctly returns just the Brown foxes document.

그리고 query를 다시 해 보면, 올바르게 Brown foxes document만을 반환할 것이다.

Because most of the work has been done at index time, all this query needs to do is to look up the two terms brown and fo, which is much more efficient than the match_phrase_prefix approach of having to find all terms beginning with fo.

작업의 대부분은 색인 시에 이루어졌기 때문에, 이 query가 하는 모든 것은 brown, fo 두 단어를 찾는 것이다. 이것은, fo 로 시작하는 모든 단어를 찾으려는, match_phrase_prefix 방식보다 훨씬 더 효율적이다.

Completion Suggester

Using edge n-grams for search-as-you-type is easy to set up, flexible, and fast. However, sometimes it is not fast enough. Latency matters, especially when you are trying to provide instant feedback. Sometimes the fastest way of searching is not to search at all.

instant 검색을 위해, edge n-grams를 사용하는 것은, 설정하기 쉽고 유연하고 빠르다. 그러나 때때로 충분히 빠르지 않다. 대기 시간, 특히 즉각적인 feedback을 제공하려 할 때에 문제가 있다. 때때로 검색의 가장 빠른 방법이 전혀 검색이 안된다.

The completion suggester in Elasticsearch takes a completely different approach. You feed it a list of all possible completions, and it builds them into a finite state transducer, anoptimized data structure that resembles a big graph. To search for suggestions, Elasticsearch starts at the beginning of the graph and moves character by character along the matching path. Once it has run out of user input, it looks at all possible endings of the current path to produce a list of suggestions.

Elasticsearch에서 completion suggester는 전혀 다른 접근 방식을 가진다. 거기에 모든 가능한 완료 목록을 제공하고, 그들을 유한 상태 변환기(Finite State Transducer), 즉 큰 도표와 유사한 최적화된 데이터 구조로 만든다. 제안을 검색하기 위해, Elasticsearch는 도표의 시작 부분에서 시작하여, 일치하는 경로를 따라 문자 단위로 이동한다. 일단 사용자 입력이 완료되면, 제안 목록을 생성하기 위해, 현재 경로에서 모든 가능한 끝부분까지 검토한다.

This data structure lives in memory and makes prefix lookups extremely fast, much faster than any term-based query could be. It is an excellent match for autocompletion of names and brands, whose words are usually organized in a common order: "Johnny Rotten" rather than "Rotten Johnny".

이 데이터 구조는 메모리에 존재하고, 어떤 단어 기반의 query보다 훨씬 빠르게, 매우 빠르게 접두사 조회를 한다. 이는 이름과 상표의 자동완성에 매우 잘 일치한다. 이들 단어는, "Rotten Johnny" 보다는 "Johnny Rotten" 처럼, 일반적으로 흔한 순서로 구성된다.

When word order is less predictable, edge n-grams can be a better solution than the completion suggester. This particular cat may be skinned in myriad ways.

단어 순서를 예측하기가 어려운 경우, edge n-grams는 completion suggester보다 더 나은 해결책이 될 수 있다. 이 특별한 작업은 여러 가지 방법으로 가능하다.

Edge n-grams and Postcodesedit

The edge n-gram approach can also be used for structured data, such as the postcodes example from earlier in this chapter. Of course, the postcode field would need to be analyzed instead of not_analyzed, but you could use the keyword tokenizer to treat the postcodes as if they werenot_analyzed.

edge n-gram 방식은, 이 장 초반의의 우편번호 예제처럼, 구조화된 데이터에도 사용할 수 있다. 물론, postcode field는 not_analyzed 대신 analyzed 되어야 한다. 그러나, 우편번호를 not_analyzed 인 것처럼 처리하도록, keyword tokenizer를 사용할 수 있다.

The keyword tokenizer is the no-operation tokenizer, the tokenizer that does nothing. Whatever string it receives as input, it emits exactly the same string as a single token. It can therefore be used for values that we would normally treat as not_analyzed but that require some other analysis transformation such as lowercasing.

keyword tokenizer는 아무것도 하지 않는(NOOP, no-operation) tokenizer이다. 입력으로 받는 문자열이 무엇이든, 정확히 동일한 문자열을 하나의 token으로 출력한다. 따라서, 일반적으로 not_analyzed 로 처리하는 값을 위해 사용된다. 그러나 소문자 변환 같은 다른 분석 변환을 필요로 한다.

This example uses the keyword tokenizer to convert the postcode string into a token stream, so that we can use the edge n-gram token filter:

이 예제는, edge n-gram token filter를 사용할 수 있도록, 우편번호 문자열을 token stream으로 변환하는, keyword tokenizer를 사용한다.

{
    "analysis": {
        "filter": {
            "postcode_filter": {
                "type":     "edge_ngram",
                "min_gram": 1,
                "max_gram": 8
            }
        },
        "analyzer": {
            "postcode_index": { 
                "tokenizer": "keyword",
                "filter":    [ "postcode_filter" ]
            },
            "postcode_search": { 
                "tokenizer": "keyword"
            }
        }
    }
}

COPY AS CURL VIEW IN SENSE

	`postcode_index` analyzer는 우편번호를 edge n-grams로 변환하기 위해, `postcode-filter` 를 사용한다.
	`postcode_search` analyzer는 검색어를 `not_analyzed` 인 것처럼 처리한다.

저작자표시 비영리 변경금지

'2.X > 2. Search in Depth' 카테고리의 다른 글

2-5-5. Index-Time Optimizations (0)	2017.09.24
2-5-6. Ngrams for Partial Matching (0)	2017.09.24
2-5-8. Ngrams for Compound Words (0)	2017.09.24
2-6. Controlling Relevance (0)	2017.09.24
2-6-01. Theory Behind Relevance Scoring (0)	2017.09.24

현재글2-5-7. Index-Time Search-as-You-Type

elasticsearch, definitive guide

phrase, MATCH, json, replica, full-text, Shard, Term, Query, Mapping, Relevance, Size, index, primary, cache, parent, Filter, Type, inverted, score, Cluster,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

不爲也比不能也