'2.X/3. Dealing with Human Language' 카테고리의 글 목록 (3 Page)

3-3-6. Sorting and Collations

So far in this chapter, we have looked at how to normalize tokens for the purposes of search. The final use case to consider in this chapter is that of string sorting.지금까지 이 장에서, 검색을 목적으로, token을 정규화하는 방법을 살펴보았다. 이 장에서 고려할 마지막 사용 사례는 문자열의 정렬 사례이다.In String Sorting and Multifields, we explained that Elasticsearch cannot sort on an analyzed string field, and demonstrated how to use multifields to ..

2.X/3. Dealing with Human Language 2017.09.24

3-4. Reducing Words to Their Root Form

Most languages of the world are inflected, meaning that words can change their form to express differences in the following:세상의 모든 언어는 어형이 굴절(inflect) 된다. 즉, 단어의 차이점을 표현하기 위해, 단어의 형태를 변경할 수 있다.Number: fox, foxesTense: pay, paid, payingGender: waiter, waitressPerson: hear, hearsCase: I, me, myAspect: ate, eatenMood: so be it, were it soWhile inflection aids expressivity, it interferes with retrie..

2.X/3. Dealing with Human Language 2017.09.24

3-4-1. Algorithmic Stemmers

Most of the stemmers available in Elasticsearch are algorithmic in that they apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals. They don’t have to know anything about individual words in order to stem them.Elasticsearch에서 이용할 수 있는 대부분의 형태소 분석기는, 단어를 원형으로 축소하기 위해, 복수형에서 마지막의 s 나 es 를 떼어내는 것 같은, 일련의 규칙을 적용하는 알고리즘이다. 형태소 분석을 위..

2.X/3. Dealing with Human Language 2017.09.24

3-4-2. Dictionary Stemmers

Dictionary stemmers work quite differently from algorithmic stemmers. Instead of applying a standard set of rules to each word, they simply look up the word in the dictionary. Theoretically, they could produce much better results than an algorithmic stemmer. A dictionary stemmer should be able to do the following: 사전 형태소 분석기(dictionary stemmers) 는 algorithmic stemmers와 전혀 다르게 동작한다. 각 단어에 규칙의 기준을..

2.X/3. Dealing with Human Language 2017.09.24

3-4-3. Hunspell Stemmer

Elasticsearch provides dictionary-based stemming via the hunspell token filter. Hunspell hunspell.github.io is the spell checker used by Open Office, LibreOffice, Chrome, Firefox, Thunderbird, and many other open and closed source projects.Elasticsearch는 hunspell token filter를 통해, 사전 기반의 형태소 분석을 제공한다. Hunspell hunspell.sourceforge.net은 Open Office, Libre Office, Chrome, FireFox, Thunderbird 그리고 ..

2.X/3. Dealing with Human Language 2017.09.24

3-4-4. Choosing a Stemmer

The documentation for the stemmer token filter lists multiple stemmers for some languages. For English we have the following:stemmer token filter에 대한 문서에서는, 특정 언어에 대한 여러 가지 형태소 분석기를 나열하고 있다. 예를 들어 영어를 보면,englishThe porter_stem token filter.light_englishThe kstem token filter.minimal_englishThe EnglishMinimalStemmer in Lucene, which removes plurals복수형을 제거하는 Lucene의 English Minimal StemmerlovinsTh..

2.X/3. Dealing with Human Language 2017.09.24

3-4-5. Controlling Stemming

Out-of-the-box stemming solutions are never perfect. Algorithmic stemmers, especially, will blithely apply their rules to any words they encounter, perhaps conflating words that you would prefer to keep separate. Maybe, for your use case, it is important to keep skies and skiing as distinct words rather than stemming them both down to ski (as would happen with the english analyzer).내장된 형태소 분석기는 ..

2.X/3. Dealing with Human Language 2017.09.24

3-4-6. Stemming in situ

For the sake of completeness, we will finish this chapter by explaining how to index stemmed words into the same field as unstemmed words. As an example, analyzing the sentence The quick foxes jumped would produce the following terms:완벽을 기하기 위하여, 형태소 분석을 하지 않은 단어와 형태소 분석을 한 단어를, 동일한 field에 색인하는 방법을 설명하면서, 이 장를 마무리하겠다. 예를 들어, The quick foxes jumped 라는 문장을 분석하면, 아래와 같은 단어를 얻을 수 있다.Pos 1: (the) Pos..

2.X/3. Dealing with Human Language 2017.09.24

3-5. Stopwords: Performance Versus Precision

Back in the early days of information retrieval, disk space and memory were limited to a tiny fraction of what we are accustomed to today. It was essential to make your index as small as possible. Every kilobyte saved meant a significant improvement in performance. Stemming (see Reducing Words to Their Root Form) was important, not just for making searches broader and increasing retrieval in the..

2.X/3. Dealing with Human Language 2017.09.24

3-5-1. Pros and Cons of Stopwords

We have more disk space, more RAM, and better compression algorithms than existed back in the day. Excluding the preceding 33 common words from the index will save only about 4MB per million documents. Using stopwords for the sake of reducing index size is no longer a valid reason. (However, there is one caveat to this statement, which we discuss in Stopwords and Phrase Queries.)우리는 과거에 존재했던 것보다..

2.X/3. Dealing with Human Language 2017.09.24

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

不爲也比不能也

2.X/3. Dealing with Human Language 49

티스토리툴바