3-3-3. Living in a Unicode World

2.X/3. Dealing with Human Language

3-3-3. Living in a Unicode World

drscg 2017. 9. 24. 17:21

When Elasticsearch compares one token with another, it does so at the byte level. In other words, for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.

Elasticsearch가 어떤 token과 다른 것을 비교하는 경우, byte 수준으로 비교한다. 즉, 두 token이 동일하다고 간주되기 위해서는, 정확히 동일한 byte로 구성되어야 한다. 그러나, Unicode는 동일한 문자를 다른 방법으로 쓸 수 있다.

For instance, what’s the difference between é and é? It depends on who you ask. According to Elasticsearch, the first one consists of the two bytes 0xC3 0xA9, and the second one consists of three bytes, 0x65 0xCC 0x81.

예를 들자면, é 와 é 의 차이점이 무엇일까? 누구에게 질문하느냐에 따라 다르다. Elasticsearch에 따르면, 첫 번째 문자는 0xC3 0xA9 의 두 byte로 구성되어 있고, 두 번째 문자는 0x65 0xCC 0x81 의 세 byte로 구성되어 있다.

According to Unicode, the differences in how they are represented as bytes is irrelevant, and they are the same letter. The first one is the single letter é, while the second is a plain e combined with an acute accent ´.

Unicode에 따르면, byte로 표현되는 방법의 차이에 관계없이, 그들은 동일한 문자이다. 첫 번째는 단일 문자 é 이고, 두 번째는 평범한 e 에 양음 액센트 부호(acute accent) ´ 가 결합된 문자이다.

If you get your data from more than one source, it may happen that you have the same letters encoded in different ways, which may result in one form of déjà not matching another!

하나 이상의 소스에서 데이터를 얻는 경우에, 다른 방식으로 인코딩된 동일한 문자를 얻을 수 있다. 그리고, déjà 의 어떤 형태가 다른 것과 일치하지 않을 수 있다.

Fortunately, a solution is at hand. There are four Unicode normalization forms, all of which convert Unicode characters into a standard format, making all characters comparable at a byte level: nfc, nfd, nfkc, nfkd.

다행히도, 해결책이 있다. 4가지 Unicode 정규화 형식(normalization forms)— nfc, nfd, nfkc, nfkd—이 있는데, 이것 모두는, 모든 글자를 byte 수준으로 비교할 수 있도록, Unicode 글자를 표준 형식으로 변경한다.

Unicode 정규화 형식

The composed forms—nfc and nfkc—represent characters in the fewest bytes possible.So é is represented as the single letter é. The decomposed forms—nfd and nfkd—represent characters by their constituent parts, that is e + ´.

nfc, nfkc 같은 합성(composed) 형식은 가능한 한 최소한의 byte로 문자를 나타낸다. 때문에, é 는 하나의 문자 é 로 나타낸다. nfd, nfkd 같은 분해(decomposed) 형식은 문자의 구성 요소(e + ´)로 문자를 나타낸다.

The canonical forms—nfc and nfd—represent ligatures like ﬃ or œ as a single character,while the compatibility forms—nfkc and nfkd—break down these composed characters into a simpler multiletter equivalent: f + f + i or o + e.

nfc, nfd 같은 표준(canonical) 형식은 ﬃ 또는 œ 같은 합자(ligatures)를 하나의 문자로 나타낸다. 반면에, nfkc, nfkd 같은 호환(compatibility) 형식은 이들 합성된 글자를 보다 간단한, 상응하는 여러 문자(f + f + i 또는 o + e)로 나눈다.

It doesn’t really matter which normalization form you choose, as long as all your text is in the same form. That way, the same tokens consist of the same bytes. That said, the compatibility forms allow you to compare ligatures like ﬃ with their simpler representation, ffi.

모든 텍스트가 동일한 형식인 한, 어떤 정규화 형식을 선택하더라도 관계없다. 그런 식으로, 동일한 token은 동일한 byte로 구성되어 있다. 그렇지만, 호환(compatibility) 형식에서는 ﬃ 같은 합자(ligatures)를 더 간단한 표현인 ffi 와 비교할 수 있다.

You can use the icu_normalizer token filter to ensure that all of your tokens are in the same form:

token 모두가 동일한 형식이라는 것을 보장하기 위해, icu_normalizer token filter를 사용할 수 있다.

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "nfkc_normalizer": { 
          "type": "icu_normalizer",
          "name": "nfkc"
        }
      },
      "analyzer": {
        "my_normalizer": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "nfkc_normalizer" ]
        }
      }
    }
  }
}

모든 token을 nfkc 정규화 형식으로 정규화한다.

Besides the icu_normalizer token filter mentioned previously, there is also anicu_normalizer character filter, which does the same job as the token filter, but does so before the text reaches the tokenizer. When using the standard tokenizer or icu_tokenizer, this doesn’t really matter. These tokenizers know how to deal with all forms of Unicode correctly.

위에서 언급한 icu_normalizer token filter 외에, icu_normalizer character filter가 있다. 이것은 token filter와 동일한 동작을 하나, 텍스트가 tokenizer에 도달하기 전에 그 동작을 한다. standard tokenizer나 icu_tokenizer 를 사용하는 경우, 이것은 문제가 되지 않는다. 이들 tokenizer는 Unicode의 모든 형식을 올바르게 다루는 방법을 알고 있다.

However, if you plan on using a different tokenizer, such as the ngram, edge_ngram, or pattern tokenizers, it would make sense to use the icu_normalizer character filter in preference to the token filter.

그러나, ngram, edge_ngram, pattern tokenizer 같은 다른 tokenizer를 사용할 계획이라면, token filter에 앞서 icu_normalizer character filter를 사용하는 것이 합리적이다.

Usually, though, you will want to not only normalize the byte order of tokens, but also lowercase them. This can be done with icu_normalizer, using the custom normalization form nfkc_cf, which we discuss in the next section.

일반적으로, token의 byte 순서를 정규화하는 것뿐 아니라 소문자로 변경하기를 원할 것이다. 다음에 이야기할, 사용자 정의 정규화 형식인 nfkc_cf 를 사용하는, icu_normalizer 로 가능하다.

저작자표시 비영리 변경금지 (새창열림)

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

3-3-1. In That Case (0)	2017.09.24
3-3-2. You Have an Accent (0)	2017.09.24
3-3-4. Unicode Case Folding (0)	2017.09.24
3-3-5. Unicode Character Folding (0)	2017.09.24
3-3-6. Sorting and Collations (0)	2017.09.24

현재글3-3-3. Living in a Unicode World

elasticsearch, definitive guide

Shard, Type, MATCH, cache, json, Relevance, score, Size, index, Cluster, Mapping, inverted, phrase, Filter, full-text, Term, parent, replica, primary, Query,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

不爲也比不能也