2.X/3. Dealing with Human Language 49

3-5-2. Using Stopwords

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer (see Using the stop Token Filter). However, some out-of-the-box analyzers come with the stop filter pre-integrated:불용어의 제거는, 사용자 정의(custom) analyzer(Using the stop Token Filter 참조)를 생성하는 경우 사용될 수 있는, stop token filter에 의해 처리된다. 그러나, 몇몇 내장된 analyzer는 stop filter가 이미 통합되어 들어 있다.Language a..

3-5-3. Stopwords and Performance

The biggest disadvantage of keeping stopwords is that of performance. When Elasticsearch performs a full-text search, it has to calculate the relevance _score on all matching documents in order to return the top 10 matches.불용어를 유지하는 경우의 가장 큰 단점은 성능이다. Elasticsearch가 full-text 검색을 수행할 경우, 상위 10개의 document를 반환하기 위해, 일치하는 모든 document의 relevance _score 를 계산해야 한다.While most words typically occur in m..

3-5-4. Divide and Conquer

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.query string에서 단어는, 더 중요하고(낮은 빈도의), 덜 중요한(높은 빈도) 단어로 나누어진다. 덜 중요한 단어만이 일치하는 document는 아마도 관심이 아주 적을 것이..

3-5-6. common_grams Token Filter

The common_grams token filter is designed to make phrase queries with stopwords more efficient. It is similar to the shingles token filter (see Finding Associated Words), which creates bigrams out of every pair of adjacent words. It is most easily explained by example.common_grams token filter는, 불용어를 사용하여, 보다 효율적으로 phrase query를 만들기 위해 설계되었다. 그것은 인접한 단어의 모든 쌍으로, bigram 을 생성(Finding Associated Wo..

3-5-7. Stopwords and Relevance

The last topic to cover before moving on from stopwords is that of relevance. Leaving stopwords in your index could make the relevance calculation less accurate, especially if your documents are very long.불용어를 마치기 전의, 마지막 주제는 relevance를 다루는 것이다. index에 불용어를 남겨두면, 특히 document가 매우 길 경우, relevance 계산이 부정확해질 수 있다.As we have already discussed in Term-frequency saturation, the reason for this is that ..

3-6. Synonyms

While stemming helps to broaden the scope of search by simplifying inflected words to their root form, synonyms broaden the scope by relating concepts and ideas. Perhaps no documents match a query for "English queen", but documents that contain "British monarch" would probably be considered a good match.형태소 분석은 굴절된 단어를 원형으로 단순화하여, 검색의 범위를 확장하는데 도움이 되는 반면에, 동의어는, 개념과 뜻을 관련시켜, 범위를 확대한다. "English q..

3-6-1. Using Synonyms

Synonyms can replace existing tokens or be added to the token stream by using the synonym token filter:동의어는 기존의 token을 대체하거나, synonym token filter를 사용하여 token stream에 추가될 수 있다.PUT /my_index { "settings": { "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "british,english", "queen,monarch" ] } }, "analyzer": { "my_synonyms": { "tokenizer": "standard", "filter": [ ..

3-6-2. Formatting Synonyms

In their simplest form, synonyms are listed as comma-separated values:가장 단순한 형태로, 동의어는, 다음과 같이, comma로 구분된 값으로 나열된다."jump,leap,hop"If any of these terms is encountered, it is replaced by all of the listed synonyms. For instance:이들 단어 중 하나를 만나면, 그것은 나열된 동의어 모두로 대체된다. 예를 들자면,Original terms: Replaced by: ──────────────────────────────── jump → (jump,leap,hop) leap → (jump,leap,hop) hop → (jump,leap..

3-6-3. Expand or contract

In Formatting Synonyms, we have seen that it is possible to replace synonyms by simple expansion, simple contraction, or generic expansion. We will look at the trade-offs of each of these techniques in this section.Formatting Synonyms에서, 단순한 확장(simple expansion), 단순한 축소(simple contraction) 또는 장르 확장(generic expansion) 을 통해, 동의어를 대체할 수 있음을 알게 됐다. 아래에서 각 기술의 장단점을 살펴보도록 하자.This section deals with si..