4-08. Approximate Aggregations

2.X/4. Aggregations

4-08. Approximate Aggregations

drscg 2017. 9. 23. 22:54

Life is easy if all your data fits on a single machine. Classic algorithms taught in CS201 will be sufficient for all your needs. But if all your data fits on a single machine, there would be no need for distributed software like Elasticsearch at all. But once you start distributing data, algorithm selection needs to be made carefully.

모든 데이터가 단일 시스템으로 충분하다면, 참 쉬울 것이다. CS201에서 배운 고전적인 알고리즘으로, 모든 요구 사항을 충족시킬 것이다. 하지만, 모든 데이터가 단일 시스템으로 충분하다면, Elasticsearch 같은 분산 소프트웨어는 전혀 불필요할 것이다. 그러나, 데이터를 분산하기 시작하면, 알고리즘의 선택은 신중해야 한다.

Some algorithms are amenable to distributed execution. All of the aggregations discussed thus far execute in a single pass and give exact results. These types of algorithms are often referred to as embarrassingly parallel, because they parallelize to multiple machines with little effort. When performing a max metric, for example, the underlying algorithm is very simple:

어떤 알고리즘은 분산 실행을 해야 한다. 지금까지 언급했던 모든 aggregation은 한 번 실행으로, 정확한 결과를 제공한다. 이런 유형의 알고리즘은 약간의 노력으로 여러 시스템에 병렬화할 수 있어, 종종 당혹스러운(embarrassingly) 병렬 이라고 한다. 예를 들어, max metric을 실행하는 경우, 기본 알고리즘은 매우 간단하다.

Broadcast the request to all shards.
모든 shard에 request를 전달(broadcast)한다.
Look at the price field for each document. If price > current_max, replace current_max with price.
각 document에 대해, price field를 확인한다. price > current_max 이면, current_max 를 price로 바꾼다.
Return the maximum price from all shards to the coordinating node.
모든 shard에서 최대 가격을 조정 node로 반환한다.
Find the maximum price returned from all shards. This is the true maximum.
모든 shard에서 반환된 price에서 최대값을 찾는다. 이것이 진정한 최대값이다.

The algorithm scales linearly with machines because the algorithm requires no coordination (the machines don’t need to discuss intermediate results), and the memory footprint is very small (a single integer representing the maximum).

알고리즘은 어떤 조정(시스템은 중간 결과를 확인할 필요가 없다.)도 필요하지 않기 때문에, 알고리즘은 시스템에 선형으로 확장하며, 메모리 공간도 매우 작다(최대 값을 나타내는 하나의 정수)

Not all algorithms are as simple as taking the maximum value, unfortunately. More complex operations require algorithms that make conscious trade-offs in performance and memory utilization. There is a triangle of factors at play: big data, exactness, and real-time latency.

유감스럽게도, 모든 알고리즘이 최대값을 가져오는 것만큼 간단하지가 않다. 더 복잡한 연산은 성능과 메모리 활용에서 의식적인 균형을 찾아야 하는, 알고리즘이 필요하다. 세 가지 요소가 있다. BigData, 정확성(exactness), 실시간 대기 시간(real-time latency)

You get to choose two from this triangle:

이 3가지 중 2개를 선택해야 한다.

Exact + real time

Your data fits in the RAM of a single machine. The world is your oyster; use any algorithm you want. Results will be 100% accurate and relatively fast.

데이터는 단일 시스템의 RAM에서 적합하다. 무엇이든 가능하다(The world is your oyster). 어떤 알고리즘을 사용해도 된다. 결과는 100% 정확하고, 비교적 빠르다.

Big data + exact

A classic Hadoop installation. Can handle petabytes of data and give you exact answers—but it may take a week to give you that answer.

고전적인 Hadoop의 설치. PB(petabytes)의 데이터를 다룰 수 있고, 정확한 답을 제공할 것이다. 그러나, 그 대답에 일주일 정도 걸릴 것이다.

Big data + real time

Approximate algorithms that give you accurate, but not exact, results.

정확한 결과를 제공할 수 있는 대략적인(approximate) 알고리즘, 그러나 100% 정확하지는 않다.

Elasticsearch currently supports two approximate algorithms (cardinality and percentiles). These will give you accurate results, but not 100% exact. In exchange for a little bit of estimation error, these algorithms give you fast execution and a small memory footprint.

현재, Elasticsearch는 2개의 approximate 알고리즘(cardinality 와 percentiles)을 지원한다. 이들은 정확한 결과를 제공하지만, 100% 정확하지는 않다. 약간의 추정 오차의 대가로, 이들 알고리즘은 빠른 실행과 작은 메모리 공간을 제공한다.

For most domains, highly accurate results that return in real time across all your data is more important than 100% exactness. At first blush, this may be an alien concept to you. "We need exact answers!" you may yell. But consider the implications of a 0.5% error:

대부분 의 영역에서, 모든 데이터 에 대해, 실시간 으로 반환하는 매우 정확한 결과는 100% 의 정확함보다 더 중요하다. 언뜻 보기에, 이상할 수도 있다. "정확한 답변이 필요해요" 라고 말할 수 있다. 그러나, 0.5%의 오차에 대한 영향을 고려해 보자.

The true 99th percentile of latency for your website is 132ms.
website 대기 시간의 진정한 99번째 백분위(percentiles)는 132ms 이다.
An approximation with 0.5% error will be within +/- 0.66ms of 132ms.
0.5% 오차를 가진 대략적인 값은 132ms의 +/- 0.6ms 이내이다.
The approximation returns in milliseconds, while the "true" answer may take seconds, or be impossible.
대략적인 값은 수 ms(milliseconds)안에 반환될 것이다. "진정한" 답은 수초가 걸리거나 불가능할 것이다.

For simply checking on your website’s latency, do you care if the approximate answer is 132.66ms instead of 132ms? Certainly, not all domains can tolerate approximations—but the vast majority will have no problem. Accepting an approximate answer is more often a cultural hurdle rather than a business or technical imperative.

website 대기 시간을 간단히 확인하는 경우, 대략적인 답이 132ms 대신 132.66ms인 것이 의미가 있는가? 물론, 모든 영역에서 대략적인 값을 허용할 수는 없겠지만, 대부분은 아무 문제가 되지 않는다. 대략적인 답이라는 것은 비즈니스나 기술 장애라기 보다는 흔한 문화적 장애물이다.

저작자표시 비영리 변경금지

'2.X > 4. Aggregations' 카테고리의 다른 글

4-07-2. Sorting by a Metric (0)	2017.09.23
4-07-3. Sorting Based on "Deep" Metrics (0)	2017.09.23
4-08-1. Finding Distinct Counts (0)	2017.09.23
4-08-2. Calculating Percentiles (0)	2017.09.23
4-09. Significant Terms (0)	2017.09.23

현재글4-08. Approximate Aggregations

elasticsearch, definitive guide

score, primary, phrase, Mapping, Size, index, Cluster, Relevance, Type, MATCH, full-text, cache, Query, json, parent, Term, Filter, replica, inverted, Shard,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

不爲也比不能也