4-10-3. Aggregations and Analysis

2.X/4. Aggregations

4-10-3. Aggregations and Analysis

drscg 2017. 9. 23. 22:33

Some aggregations, such as the terms bucket, operate on string fields. And string fields may be either analyzed or not_analyzed, which raises the question: how does analysis affect aggregations?

terms bucket 같은 어떤 aggregation은 string field에서 동작한다. 그리고, string field는 analyzed 나 not_analyzed 일 것이다. 그렇다면 질문이 있을 것이다. 분석은 aggregation에 어떤 영향을 미치는가?

The answer is "a lot," for two reasons: analysis affects the tokens used in the aggregation, and doc values do not work with analyzed strings.

답은 "아주 많이" 이다. 두 가지 이유가 있다. 분석은 aggregation에 사용되는 token에 영향을 미친다. 그리고, doc values는 analyzed string에서는 동작하지 않는다.

Let’s tackle the first problem: how the generation of analyzed tokens affects aggregations. First, let’s index some documents representing various states in the US:

analyzed token의 생성이 aggregation에 얼마나 영향을 미치는가 라는 첫 번째 문제를 알아보자. 먼저, 미국의 몇몇 주를 나타내는 document를 색인 하자.

POST /agg_analysis/data/_bulk
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New Jersey" }
{ "index": {}}
{ "state" : "New Mexico" }
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New York" }

We want to build a list of unique states in our dataset, complete with counts. Simple—let’s use a terms bucket:

데이터 집합에서 유일한 주(state)와 그 수를 나타내는 목록을 만들려고 한다. 간단하게, terms bucket을 사용해 보자.

GET /agg_analysis/data/_search
{
    "size" : 0,
    "aggs" : {
        "states" : {
            "terms" : {
                "field" : "state"
            }
        }
    }
}

This gives us these results:

다음과 같은 결과가 나올 것이다.

{
...
   "aggregations": {
      "states": {
         "buckets": [
            {
               "key": "new",
               "doc_count": 5
            },
            {
               "key": "york",
               "doc_count": 3
            },
            {
               "key": "jersey",
               "doc_count": 1
            },
            {
               "key": "mexico",
               "doc_count": 1
            }
         ]
      }
   }
}

Oh dear, that’s not at all what we want! Instead of counting states, the aggregation is counting individual words. The underlying reason is simple: aggregations are built from the inverted index, and the inverted index is post-analysis.

이런, 우리가 원하던 것이 전혀 아니다. aggregation은, 주(state)의 수를 세는 것이 아니라, 개별 단어를 세고 있다. 근본적인 이유는 간단하다. aggregation은 inverted index에서 만들어지고, inverted index는 사후 분석(post-analysis) 이다.

When we added those documents to Elasticsearch, the string "New York" was analyzed/tokenized into ["new", "york"]. These individual tokens were then used to populate aggregation counts, and ultimately we see counts for new instead of New York.

Elasticsearch에 이들 document를 추가하면, "New York" 이라는 문자열은 분석되고, token으로 만들어져,["new", "york"] 이 된다. 그 다음에, 이들 개별 token은 aggregation의 count를 채우는데 사용되고, 결과적으로 New York 대신 new 의 수를 알게 된다.

This is obviously not the behavior that we wanted, but luckily it is easily corrected.

이것은 확실히 우리가 원하던 바가 아니다. 하지만, 다행히도 쉽게 수정할 수 있다.

We need to define a multifield for state and set it to not_analyzed. This will prevent New York from being analyzed, which means it will stay a single token in the aggregation. Let’s try the whole process over, but this time specify a raw multifield:

주(state) 에 대한 다중 field를 정의하고, 그것을 not_analyzed 로 설정해야 한다. 이것은 New York 이 분석되지 않도록 한다. 즉, 그것은 aggregation 시에 단일 token으로 남게 된다. raw 라는 다중 field를 지정해서, 전체 프로세스를 다시 시도해 보자.

DELETE /agg_analysis/
PUT /agg_analysis
{
  "mappings": {
    "data": {
      "properties": {
        "state" : {
          "type": "string",
          "fields": {
            "raw" : {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

POST /agg_analysis/data/_bulk
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New Jersey" }
{ "index": {}}
{ "state" : "New Mexico" }
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New York" }

GET /agg_analysis/data/_search
{
  "size" : 0,
  "aggs" : {
    "states" : {
        "terms" : {
            "field" : "state.raw" 
        }
    }
  }
}

	이번에는 확실하게, `states` field를 지정하고, `not_analyzed` 하위 field를 포함하였다.
	aggregation은 `state` 가 아닌 `state.raw` 로 실행된다.

Now when we run our aggregation, we get results that make sense:

이제, aggregation을 실행해 보면, 만족스러운 결과가 나온다.

{
...
   "aggregations": {
      "states": {
         "buckets": [
            {
               "key": "New York",
               "doc_count": 3
            },
            {
               "key": "New Jersey",
               "doc_count": 1
            },
            {
               "key": "New Mexico",
               "doc_count": 1
            }
         ]
      }
   }
}

In practice, this kind of problem is easy to spot. Your aggregations will simply return strange buckets, and you’ll remember the analysis issue. It is a generalization, but there are not many instances where you want to use an analyzed field in an aggregation. When in doubt, add a multifield so you have the option for both.

실제로, 이런 문제는 쉽게 찾을 수 있다. aggregation은 단순히 이상한 bucket을 반환하고, 분석 문제를 제기할 것이다. 일반적이지만, aggregation에 analyzed field를 사용하려는 경우가 많은 것은 아니다. 의심이 들면, 둘 모두를 위해, 선택이 가능하도록 다중 field를 추가하자.

Analyzed strings and Fielddataedit

While the first problem relates to how data is aggregated and displayed to your user, the second problem is largely technical and behind the scenes.

첫 번째 문제가 데이터를 사용자에게 aggregation해서 보여주는 방법과 관련이 있다면, 두 번째 문제는 주로 기술적이고 이면의 배경과 관련이 있다.

Doc values do not support analyzed string fields because they are not very efficient at representing multi-valued strings. Doc values are most efficient when each document has one or several tokens, but not thousands as in the case of large, analyzed strings (imagine a PDF body, which may be several megabytes and have thousands of unique tokens).

doc values는 multi-valued string을 나타내는데 매우 비효율적이기 때문에, analyzed string field를 지원하지 않는다. doc values는 각 document가 하나 이상의 token을 가지는 경우에 가장 효율적이다. 많은 경우에서처럼 수천 개나 analyzed string은 아니다.(수 MB이고 수천 개의 유일한 token을 가지고 있는 PDF body를 상상해 보자)

For that reason, doc values are not generated for analyzed strings. Yet these fields can still be used in aggregations. How is that possible?

이와 같은 이유로, doc values는 analyzed string에 대해서는 생성되지 않는다. 그러나 이들 field는 여전히 aggregation에 사용될 수 있다. 어떻게 가능할까?

The answer is a data structure known as fielddata. Unlike doc values, fielddata is built and managed 100% in memory, living inside the JVM heap. That means it is inherently less scalable and has a lot of edge-cases to watch out for. The rest of this chapter are addressing the challenges of fielddata in the context of analyzed strings

정답은 fielddata 로 알려진 데이터 구조이다. doc values와 달리, fielddata는 100% JVM heap 내에 있는 메모리에서 생성되고 관리된다. 즉, 본질적으로 확장성이 떨어지고, 주의해야할 한계값이 많다. 이 장의 나머지 부분은 analyzed string 문맥에서 fielddata에 대한 과제를 다룰 것이다.

Historically, fielddata was the default for all fields, but Elasticsearch has been migrating towards doc values to reduce the chance of OOM. Analyzed strings are the last holdout where fielddata is still used. The goal is to eventually build a serialized data structure similar to doc values which can handle highly dimensional analyzed strings, obsoleting fielddata once and for all.

원래, fielddata는 all field의 기본 값이었다. 그러나, Elasticsearch는 OOM을 줄이기 위해 doc values쪽으로 선회했다. analyzed string은 fielddata가 여전히 사용되는 마지막 카드이다. 결국 높은 차원의 analyzed string과 한물간 fielddata를 한번에 다룰 수 있는, doc values와 유사한 직렬화된 데이터 구조를 만드는 것이 목표이다.

High-Cardinality Memory Implicationsedit

There is another reason to avoid aggregating analyzed fields: high-cardinality fields consume a large amount of memory when loaded into fielddata. The analysis process often (although not always) generates a large number of tokens, many of which are unique. This increases the overall cardinality of the field and contributes to more memory pressure.

analyzed field의 aggregation을 피하려는 또 다른 이유가 있다. 높은 cardinality를 가진 field가 fielddata에 로드되면, 아주 많은 양의 메모리를 사용한다. 분석 프로세스는 흔히 (항상은 아니지만), 아주 많은 token과 많은 유일한 token을 생성한다. 이것은 field의 전체 cardinality를 증가시키고, 더 많은 메모리 압박에 기여한다.

Some types of analysis are extremely unfriendly with regards to memory. Consider an n-gram analysis process. The term New York might be n-grammed into the following tokens:

분석의 특정 유형은 메모리에 대해 매우 비우호적이다. n-gram 분석 프로세스를 생각해 보자. New York 이라는 단어는 n-gram 되어, 다음과 같은 token이 된다.

ne
ew
w
y
yo
or
rk

You can imagine how the n-gramming process creates a huge number of unique tokens, especially when analyzing paragraphs of text. When these are loaded into memory, you can easily exhaust your heap space.

n-gram 프로세스가 얼마나 많은 유일한 token을 생성하는지, 특히 문장의 단락을 분석하는 경우를 생각해 보자. 이들을 메모리에 로드되면, 쉽게 힙(heap) 공간을 소모할 수 있다.

So, before aggregating string fields, assess the situation:

그러니, string field를 aggregation하기 전에, 상황을 생각해 보자.

Is it a not_analyzed field? If yes, the field will use doc values and be memory-friendly
not_analyzed field인가? 그렇다면, 해당 field는 doc values를 사용할 것이고 메모리 친화적일 것이다.
Otherwise, this is an analyzed field. It will use fielddata and live in-memory. Does this field have a very large cardinality caused by ngrams, shingles, etc? If yes, it may be very memory unfriendly.
그렇지 않다면, analyzed field이다. fielddata를 사용할 것이고 memory에 존재할 것이다. 이 field가 ngram, shingle 등으로 인해 매우 높은 cardinality를 가지는가? 그렇다면, 메모리와 전혀 친화적이지 않다.

저작자표시 비영리 변경금지

'2.X > 4. Aggregations' 카테고리의 다른 글

4-10-1. Doc Values (0)	2017.09.23
4-10-2. Deep Dive on Doc Values (0)	2017.09.23
4-10-4. Limiting Memory Usage (0)	2017.09.23
4-10-5. Fielddata Filtering (0)	2017.09.23
4-10-6. Preloading Fielddata (0)	2017.09.23

현재글4-10-3. Aggregations and Analysis

elasticsearch, definitive guide

parent, Query, index, json, Term, Mapping, Filter, inverted, score, Size, primary, Cluster, Shard, replica, cache, Type, phrase, Relevance, full-text, MATCH,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

不爲也比不能也