'2.X/3. Dealing with Human Language' 카테고리의 글 목록 (2 Page)

3-2-2. standard Tokenizer

A tokenizer accepts a string as input, processes the string to break it into individual words, or tokens(perhaps discarding some characters like punctuation), and emits a token stream as output.tokenizer 는 입력으로 string을 받고, string을 개별 단어나 token 으로 나누고(아마도, 문장 부호 같은 몇 가지 문자는 버린다), token stream 을 출력한다.What is interesting is the algorithm that is used to identify words. The whitespace tokenizer simp..

2.X/3. Dealing with Human Language 2017.09.24

3-2-3. Installing the ICU Plug-in

The ICU analysis plug-in for Elasticsearch uses the International Components for Unicode (ICU) libraries (see site.project.org) to provide a rich set of tools for dealing with Unicode. These include the icu_tokenizer, which is particularly useful for Asian languages, and a number of token filters that are essential for correct matching and sorting in all languages other than English.Elasticsearc..

2.X/3. Dealing with Human Language 2017.09.24

3-2-4. icu_tokenizer

The icu_tokenizer uses the same Unicode Text Segmentation algorithm as the standard tokenizer,but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.icu_tokenizer 는 standard tokenizer와 동일한 Unicode 텍스트 분할 알고리즘(Unicode Text Segmentation..

2.X/3. Dealing with Human Language 2017.09.24

3-2-5. Tidying Up Input Text

Tokenizers produce the best results when the input text is clean, valid text, where valid means that it follows the punctuation rules that the Unicode algorithm expects. Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output.tokenizer는 입력 문장(text)이 깔끔할 경우 최고의 결과를 만들어 낸다. 유효한 곳에 유효한 문장(valid text, where valid)이..

2.X/3. Dealing with Human Language 2017.09.24

3-3. Normalizing Tokens

Breaking text into tokens is only half the job. To make those tokens more easily searchable, they need to go through a normalization process to remove insignificant differences between otherwise identical words, such as uppercase versus lowercase. Perhaps we also need to remove significant differences, to make esta, ésta, and está all searchable as the same word. Would you search for déjà vu, or..

2.X/3. Dealing with Human Language 2017.09.24

3-3-1. In That Case

The most frequently used token filter is the lowercase filter, which does exactly what you would expect; it transforms each token into its lowercase form:가장 자주 사용되는 token filter는 lowercase filter이다. 이것은 여러분이 예상한 것과 마찬가지로, 각각의 token을 소문자로 변경한다.GET /_analyze?tokenizer=standard&filters=lowercase The QUICK Brown FOX! 출력되는 tokens the, quick, brown, foxIt doesn’t matter whether users search for fox or..

2.X/3. Dealing with Human Language 2017.09.24

3-3-2. You Have an Accent

English uses diacritics (like ´, ^, and ¨) only for imported words—like rôle, déjà, and däis—but usually they are optional. Other languages require diacritics in order to be correct. Of course, just because words are spelled correctly in your index doesn’t mean that the user will search for the correct spelling.영어만이 외래어(예: rôle, déjà, däis)를 위해, 발음 구별 부호(diacritics, 예: ´, ^, ¨)를 사용한다. 그러나, 일반적으로..

2.X/3. Dealing with Human Language 2017.09.24

3-3-3. Living in a Unicode World

When Elasticsearch compares one token with another, it does so at the byte level. In other words, for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.Elasticsearch가 어떤 token과 다른 것을 비교하는 경우, byte 수준으로 비교한다. 즉, 두 token이 동일하다고 간주되기 위해서는, 정확히 동일한 byte로 구성되어야 한다. 그러나, Unicode는 동일한 문자를 다른 방법으로..

2.X/3. Dealing with Human Language 2017.09.24

3-3-4. Unicode Case Folding

Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.인간이 창의적이지 않으면 아무것도 아니다. 그리고 인간의 언어는 이것을 반영한다. 단어의 대/소문자를 변경하는 것은 다양한 언어를 다루기 전까지는, 간단한 작업처럼 보인다.Take, for example, the lowercase German letter ß. Converting that to upper case gives you SS, which converted back to lowerca..

2.X/3. Dealing with Human Language 2017.09.24

3-3-5. Unicode Character Folding

In the same way as the lowercase token filter is a good starting point for many languages but falls short when exposed to the entire tower of Babel, so the asciifolding token filter requires a more effective Unicode character-folding counterpart for dealing with the many languages of the world.lowercase token filter와 동일한 방식이 많은 언어에 대해 좋은 출발점이지만, 비현실적인 계획(the entire tower of Babel) 앞에서는 부족하다. 따라서..

2.X/3. Dealing with Human Language 2017.09.24

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

不爲也比不能也

2.X/3. Dealing with Human Language 49

티스토리툴바