6-4-07. Time-Based Data

2.X/6. Modeling Your Data

6-4-07. Time-Based Data

drscg 2017. 9. 23. 13:00

One of the most common use cases for Elasticsearch is for logging, so common in fact that Elasticsearch provides an integrated logging platform called the ELK stack—Elasticsearch, Logstash, and Kibana—to make the process easy.

Elasticsearch에 대한 가장 일반적인 사용 사례는 logging이다. Elasticsearch는, 프로세스를 쉽게 만들기 위해, ELK stack(Elasticsearch, Logstash, Kibana)라 불리는 통합 로깅(logging) 플랫폼을 제공한다.

Logstash collects, parses, and enriches logs before indexing them into Elasticsearch. Elasticsearch acts as a centralized logging server, and Kibana is a graphic frontend that makes it easy to query and visualize what is happening across your network in near real-time.

Logstash는 Elasticsearch에 로그를 색인 하기 전에, 로그를 수집하고, 분석하고, 풍요롭게 한다.Elasticsearch는 중앙 집중화된 로깅 서버의 역할을 하고, Kibana는 쉽게 조회하고, 네트워크에 무슨 일이 일어나고 있는지를 거의 실시간으로 시각화할 수 있는, 그래픽 프런트 엔드(graphic frontend)이다.

Most traditional use cases for search engines involve a relatively static collection of documents that grows slowly. Searches look for the most relevant documents, regardless of when they were created.

검색 엔진에 대한 대부분의 전통적인 사용 사례는, 비교적 정적인, 천천히 증가하는, document의 집합을 포함한다. 검색은 document의 생성 시기에 관계 없이, 가장 관련 있는 document를 찾는다.

Logging—and other time-based data streams such as social-network activity—are very different in nature. The number of documents in the index grows rapidly, often accelerating with time. Documents are almost never updated, and searches mostly target the most recent documents. As documents age, they lose value.

로깅과 SNS같은 시간 기반의 데이터 스트림은 본질적으로 매우 다르다. index에 있는 document의 수는, 시간이 빠르게 흘러간 것처럼, 빠른 속도로 증가한다. document는 거의 절대로 업데이트되지 않고, 검색은 대부분 가장 최근의 document를 대상으로 한다. 오래된 document는 가치를 잃어버린다.

We need to adapt our index design to function with the flow of time-based data.

시간 기반의 데이터 흐름에 맞추어 작동하도록, index 설계를 적용해야 한다.

Index per Time Frameedit

If we were to have one big index for documents of this type, we would soon run out of space. Logging events just keep on coming, without pause or interruption. We could delete the old events with a scroll query and bulk delete, but this approach is very inefficient. When you delete a document, it is only marked as deleted (see Deletes and Updates). It won’t be physically deleted until the segment containing it is merged away.

이런 유형의 document에 대한, 하나의 커다란 index를 가지고 있다면, 곧 공간이 부족하게 될 것이다. 로깅 이벤트는 멈추거나 중단되지 않고 계속 발생한다. scroll query와 bulk delete로 과거의 이벤트를 삭제할 수 있다. 그러나, 이 방법은 매우 비효율적 이다. document를 삭제할 경우, 삭제된 것이라고 표시(mark, Deletes and Updates 참조)만 한다. 그것을 포함하고 있는 segment가 병합될 때까지, 물리적으로 삭제되지 않는다.

Instead, use an index per time frame. You could start out with an index per year (logs_2014) or per month (logs_2014-10). Perhaps, when your website gets really busy, you need to switch to an index per day (logs_2014-10-24). Purging old data is easy: just delete old indices.

대신, 기간별 index(index per time frame) 를 사용하자. 년도 별(logs_2014) index나, 월별 index(logs_2014_10)로 시작할 수 있다. 만일, website가 정말 바쁘다면, 일별 index(logs_2014_10_24)로 전환할 수도 있다. 이전 데이터를 제거하는 것은 쉽다. 그냥, 이전 index를 삭제하면 된다.

This approach has the advantage of allowing you to scale as and when you need to. You don’t have to make any difficult decisions up front. Every day is a new opportunity to change your indexing time frames to suit the current demand. Apply the same logic to how big you make each index. Perhaps all you need is one primary shard per week initially. Later, maybe you need five primary shards per day. It doesn’t matter—you can adjust to new circumstances at any time.

이 방법은 필요한 경우에 확장할 수 있다는 장점이 있다. 어떤 어려운 결정도 미리 할 필요가 없다. 현재의 요구 사항에 맞추어, 언제나 색인 기간을 변경할 수 있다. 각 index의 크기 결정에 동일한 논리를 적용할 수 있다. 처음에 주 별로 하나의 primary shard를 만드는 것이 필요한 전부이다. 나중에 일별로 5개의 primary shard가 필요할 수도 있지만, 중요하지 않다. 언제든지 새로운 환경에 적응할 수 있다.

Aliases can help make switching indices more transparent. For indexing, you can point logs_currentto the index currently accepting new log events, and for searching, update last_3_months to point to all indices for the previous three months:

alias는 index를 쉽게 전환하는데 도움이 될 수 있다. 색인의 경우, logs_current 를 현재 새로운 로그 이벤트를 받아들이고 있는, index를 가리키도록 할 수 있다. 그리고 검색의 경우에는, last_3_months 이 최근 3개월의 모든 index를 가리키도록 업데이트할 수 있다.

POST /_aliases
{
  "actions": [
    { "add":    { "alias": "logs_current",  "index": "logs_2014-10" }}, 
    { "remove": { "alias": "logs_current",  "index": "logs_2014-09" }}, 
    { "add":    { "alias": "last_3_months", "index": "logs_2014-10" }}, 
    { "remove": { "alias": "last_3_months", "index": "logs_2014-07" }}  
  ]
}

	`log_current` 를 9월에서 10월로 전환
	`last_3_months` 에 10월을 추가하고 7월을 제거

'2.X > 6. Modeling Your Data' 카테고리의 다른 글

6-4-05. Replica Shards (0)	2017.09.23
6-4-06. Multiple Indices (0)	2017.09.23
6-4-08. Index Templates (0)	2017.09.23
6-4-09. Retiring Data (0)	2017.09.23
6-4-10. User-Based Data (0)	2017.09.23

현재글6-4-07. Time-Based Data

elasticsearch, definitive guide

Mapping, Relevance, index, MATCH, Shard, cache, Type, json, Query, replica, Filter, phrase, parent, primary, Cluster, Term, inverted, score, Size, full-text,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

不爲也比不能也