2.X/3. Dealing with Human Language 49

3-2-2. standard Tokenizer

A tokenizer accepts a string as input, processes the string to break it into individual words, or tokens(perhaps discarding some characters like punctuation), and emits a token stream as output.tokenizer 는 입력으로 string을 받고, string을 개별 단어나 token 으로 나누고(아마도, 문장 부호 같은 몇 가지 문자는 버린다), token stream 을 출력한다.What is interesting is the algorithm that is used to identify words. The whitespace tokenizer simp..

3-2-5. Tidying Up Input Text

Tokenizers produce the best results when the input text is clean, valid text, where valid means that it follows the punctuation rules that the Unicode algorithm expects. Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output.tokenizer는 입력 문장(text)이 깔끔할 경우 최고의 결과를 만들어 낸다. 유효한 곳에 유효한 문장(valid text, where valid)이..

3-3-1. In That Case

The most frequently used token filter is the lowercase filter, which does exactly what you would expect; it transforms each token into its lowercase form:가장 자주 사용되는 token filter는 lowercase filter이다. 이것은 여러분이 예상한 것과 마찬가지로, 각각의 token을 소문자로 변경한다.GET /_analyze?tokenizer=standard&filters=lowercase The QUICK Brown FOX! 출력되는 tokens the, quick, brown, foxIt doesn’t matter whether users search for fox or..

3-3-2. You Have an Accent

English uses diacritics (like ´, ^, and ¨) only for imported words—like rôle, déjà, and däis—but usually they are optional. Other languages require diacritics in order to be correct. Of course, just because words are spelled correctly in your index doesn’t mean that the user will search for the correct spelling.영어만이 외래어(예: rôle, déjà, däis)를 위해, 발음 구별 부호(diacritics, 예: ´, ^, ¨)를 사용한다. 그러나, 일반적으로..

3-3-3. Living in a Unicode World

When Elasticsearch compares one token with another, it does so at the byte level. In other words, for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.Elasticsearch가 어떤 token과 다른 것을 비교하는 경우, byte 수준으로 비교한다. 즉, 두 token이 동일하다고 간주되기 위해서는, 정확히 동일한 byte로 구성되어야 한다. 그러나, Unicode는 동일한 문자를 다른 방법으로..

3-3-4. Unicode Case Folding

Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.인간이 창의적이지 않으면 아무것도 아니다. 그리고 인간의 언어는 이것을 반영한다. 단어의 대/소문자를 변경하는 것은 다양한 언어를 다루기 전까지는, 간단한 작업처럼 보인다.Take, for example, the lowercase German letter ß. Converting that to upper case gives you SS, which converted back to lowerca..

3-3-5. Unicode Character Folding

In the same way as the lowercase token filter is a good starting point for many languages but falls short when exposed to the entire tower of Babel, so the asciifolding token filter requires a more effective Unicode character-folding counterpart for dealing with the many languages of the world.lowercase token filter와 동일한 방식이 많은 언어에 대해 좋은 출발점이지만, 비현실적인 계획(the entire tower of Babel) 앞에서는 부족하다. 따라서..