불용어 8

3-5-1. Pros and Cons of Stopwords

We have more disk space, more RAM, and better compression algorithms than existed back in the day. Excluding the preceding 33 common words from the index will save only about 4MB per million documents. Using stopwords for the sake of reducing index size is no longer a valid reason. (However, there is one caveat to this statement, which we discuss in Stopwords and Phrase Queries.)우리는 과거에 존재했던 것보다..

3-5-2. Using Stopwords

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer (see Using the stop Token Filter). However, some out-of-the-box analyzers come with the stop filter pre-integrated:불용어의 제거는, 사용자 정의(custom) analyzer(Using the stop Token Filter 참조)를 생성하는 경우 사용될 수 있는, stop token filter에 의해 처리된다. 그러나, 몇몇 내장된 analyzer는 stop filter가 이미 통합되어 들어 있다.Language a..

3-5-3. Stopwords and Performance

The biggest disadvantage of keeping stopwords is that of performance. When Elasticsearch performs a full-text search, it has to calculate the relevance _score on all matching documents in order to return the top 10 matches.불용어를 유지하는 경우의 가장 큰 단점은 성능이다. Elasticsearch가 full-text 검색을 수행할 경우, 상위 10개의 document를 반환하기 위해, 일치하는 모든 document의 relevance _score 를 계산해야 한다.While most words typically occur in m..

3-5-4. Divide and Conquer

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.query string에서 단어는, 더 중요하고(낮은 빈도의), 덜 중요한(높은 빈도) 단어로 나누어진다. 덜 중요한 단어만이 일치하는 document는 아마도 관심이 아주 적을 것이..

3-5-6. common_grams Token Filter

The common_grams token filter is designed to make phrase queries with stopwords more efficient. It is similar to the shingles token filter (see Finding Associated Words), which creates bigrams out of every pair of adjacent words. It is most easily explained by example.common_grams token filter는, 불용어를 사용하여, 보다 효율적으로 phrase query를 만들기 위해 설계되었다. 그것은 인접한 단어의 모든 쌍으로, bigram 을 생성(Finding Associated Wo..

3-5-7. Stopwords and Relevance

The last topic to cover before moving on from stopwords is that of relevance. Leaving stopwords in your index could make the relevance calculation less accurate, especially if your documents are very long.불용어를 마치기 전의, 마지막 주제는 relevance를 다루는 것이다. index에 불용어를 남겨두면, 특히 document가 매우 길 경우, relevance 계산이 부정확해질 수 있다.As we have already discussed in Term-frequency saturation, the reason for this is that ..