'불용어' 태그의 글 목록

불용어 8

3-5. Stopwords: Performance Versus Precision

Back in the early days of information retrieval, disk space and memory were limited to a tiny fraction of what we are accustomed to today. It was essential to make your index as small as possible. Every kilobyte saved meant a significant improvement in performance. Stemming (see Reducing Words to Their Root Form) was important, not just for making searches broader and increasing retrieval in the..

2.X/3. Dealing with Human Language 2017.09.24

3-5-1. Pros and Cons of Stopwords

We have more disk space, more RAM, and better compression algorithms than existed back in the day. Excluding the preceding 33 common words from the index will save only about 4MB per million documents. Using stopwords for the sake of reducing index size is no longer a valid reason. (However, there is one caveat to this statement, which we discuss in Stopwords and Phrase Queries.)우리는 과거에 존재했던 것보다..

2.X/3. Dealing with Human Language 2017.09.24

3-5-2. Using Stopwords

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer (see Using the stop Token Filter). However, some out-of-the-box analyzers come with the stop filter pre-integrated:불용어의 제거는, 사용자 정의(custom) analyzer(Using the stop Token Filter 참조)를 생성하는 경우 사용될 수 있는, stop token filter에 의해 처리된다. 그러나, 몇몇 내장된 analyzer는 stop filter가 이미 통합되어 들어 있다.Language a..

2.X/3. Dealing with Human Language 2017.09.24

3-5-3. Stopwords and Performance

The biggest disadvantage of keeping stopwords is that of performance. When Elasticsearch performs a full-text search, it has to calculate the relevance _score on all matching documents in order to return the top 10 matches.불용어를 유지하는 경우의 가장 큰 단점은 성능이다. Elasticsearch가 full-text 검색을 수행할 경우, 상위 10개의 document를 반환하기 위해, 일치하는 모든 document의 relevance _score 를 계산해야 한다.While most words typically occur in m..

2.X/3. Dealing with Human Language 2017.09.24

3-5-4. Divide and Conquer

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.query string에서 단어는, 더 중요하고(낮은 빈도의), 덜 중요한(높은 빈도) 단어로 나누어진다. 덜 중요한 단어만이 일치하는 document는 아마도 관심이 아주 적을 것이..

2.X/3. Dealing with Human Language 2017.09.24

3-5-5. Stopwords and Phrase Queries

About 5% of all queries are phrase queries (see Phrase Matching), but they often account for the majority of slow queries. Phrase queries can perform poorly, especially if the phrase includes very common words; a phrase like "To be, or not to be" could be considered pathological. The reason for this has to do with the amount of data that is necessary to support proximity matching.모든 query의 약 5%는..

2.X/3. Dealing with Human Language 2017.09.24

3-5-6. common_grams Token Filter

The common_grams token filter is designed to make phrase queries with stopwords more efficient. It is similar to the shingles token filter (see Finding Associated Words), which creates bigrams out of every pair of adjacent words. It is most easily explained by example.common_grams token filter는, 불용어를 사용하여, 보다 효율적으로 phrase query를 만들기 위해 설계되었다. 그것은 인접한 단어의 모든 쌍으로, bigram 을 생성(Finding Associated Wo..

2.X/3. Dealing with Human Language 2017.09.24

3-5-7. Stopwords and Relevance

The last topic to cover before moving on from stopwords is that of relevance. Leaving stopwords in your index could make the relevance calculation less accurate, especially if your documents are very long.불용어를 마치기 전의, 마지막 주제는 relevance를 다루는 것이다. index에 불용어를 남겨두면, 특히 document가 매우 길 경우, relevance 계산이 부정확해질 수 있다.As we have already discussed in Term-frequency saturation, the reason for this is that ..

2.X/3. Dealing with Human Language 2017.09.24

elasticsearch, definitive guide

phrase, Mapping, parent, Size, index, score, Cluster, MATCH, Filter, inverted, replica, full-text, Type, Relevance, Query, primary, json, cache, Term, Shard,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

不爲也比不能也

불용어 8

티스토리툴바