1-03-13. Cheaper in Bulk

2.X/1. Getting Started

1-03-13. Cheaper in Bulk

drscg 2017. 10. 1. 09:55

In the same way that mget allows us to retrieve multiple documents at once, the bulk API allows us to make multiple create, index, update, or delete requests in a single step. This is particularly useful if you need to index a data stream such as log events, which can be queued up and indexed in batches of hundreds or thousands.

mget 이 다수의 document를 한번에 가져오는 것과 마찬가지로, bulk API는 다수의 create, index, update, delete request를 한 번에 처리한다. log events 같은 데이터를 색인해야 한다면, 굉장히 유용하다. 이것은 수백, 수천을 대기하게 하고, 일괄 색인할 수 있다.

The bulk request body has the following, slightly unusual, format:

bulk request body는 아래와 같은데, 약간 색다른 format이다.

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

This format is like a stream of valid one-line JSON documents joined together by newline (\n) characters. Two important points to note:

이 형태는 유효한 한 줄의 JSON document와 줄 바꿈 문자(\n)를 함께 결합한 stream 과 같다. 두 가지 중요한 사항을 주의하자.

Every line must end with a newline character (\n), including the last line. These are used as markers to allow for efficient line separation.
모든 라인은 마지막 라인을 포함하여, 줄 바꿈 문자(\n)로 끝나야 한다. 이것은 효율적인 라인 구분을 위한 구분자로 사용된다.
The lines cannot contain unescaped newline characters, as they would interfere with parsing. This means that the JSON must not be pretty-printed.
라인은 분석에 방해가 되는, unescaped된 줄 바꿈 문자(\n)를 포함할 수 없다. 즉, JSON이 pretty-print되어서는 안 된다.

In Why the Funny Format?, we explain why the bulk API uses this format.

Why the Funny Format?에서, 왜 bulk API가 이 형식을 사용하는지를 설명할 것이다.

The action/metadata line specifies what action to do to which document.

action/metadata 라인은 어느(which) document 에 어떤 동작(what action) 을 할 것인지를 지정하는 것이다.

The action must be one of the following:

action 은 아래 값 중의 하나여야 한다.

create

Create a document only if the document does not already exist. See Creating a New Document.

document가 아직 존재하지 않는 경우에만 document를 생성한다. Creating a New Document을 참고하자.

index

Create a new document or replace an existing document. See Indexing a Document and Updating a Whole Document.

새로운 document를 생성하거나, 기존의 document를 대체한다. Indexing a Document and Updating a Whole Document을 참고하자.

update

Do a partial update on a document. See Partial Updates to Documents.

특정 document에 대한 부분적인 업데이트를 한다. Partial Updates to Documents를 참고하자.

delete

Delete a document. See Deleting a Document.

document를 삭제한다. Deleting a Document를 참고하자.

The metadata should specify the _index, _type, and _id of the document to be indexed, created, updated, or deleted.

metadata 는 색인, 생성, 수정, 삭제하려는 document의 _index, _type 그리고 _id 를 지정해야 한다.

For instance, a delete request could look like this:

예를 들자면, delete request는 아래와 같다.

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

The request body line consists of the document _source itself—the fields and values that the document contains. It is required for index and create operations, which makes sense: you must supply the document to index.

request body 라인은 document의 _source(document가 포함하고 있는 field와 value) 자체로 구성되어 있다. 적절한 index 와 create 연산이 필요하다. 반드시 색인할 document를 지정해야 한다.

It is also required for update operations and should consist of the same request body that you would pass to the update API: doc, upsert, script, and so forth. No request body line is required for a delete.

update 연산을 위해서도 필요하다. update API에 전달한 것(doc, upsert, script 등)과 동일한 request body로 구성된다. delete에 대해서는 request body 라인이 필요 없다.

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

If no _id is specified, an ID will be autogenerated:

_id 가 지정되지 않으면, ID는 자동으로 생성된다.

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

To put it all together, a complete bulk request has this form:

이것 모두를 함께 모으면, 완벽한 bulk request는 아래와 같은 형태를 가진다.

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }

COPY AS CURL VIEW IN SENSE

	`delete` action은 request body를 가지지 않는다 것을 유념하자. 바로 다른 action이 나온다.
	마지막 줄 바꿈 문자(`\n`)를 기억하자.

The Elasticsearch response contains the items array, which lists the result of each request, in the same order as we requested them:

Elasticsearch의 response는, request한 것과 동일한 순서로, 각 request의 결과를 나열한, item의 배열을 포함한다.

{
   "took": 4,
   "errors": false, 
   "items": [
      {  "delete": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 2,
            "status":   200,
            "found":    true
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 3,
            "status":   201
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "EiwfApScQiiy7TIKFxRCTw",
            "_version": 1,
            "status":   201
      }},
      {  "update": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 4,
            "status":   200
      }}
   ]
}

COPY AS CURL VIEW IN SENSE

모든 하위 request가 성공적으로 완료되었다.

Each subrequest is executed independently, so the failure of one subrequest won’t affect the success of the others. If any of the requests fail, the top-level error flag is set to true and the error details will be reported under the relevant request:

각 하위 request는 독립적으로 실행된다. 따라서 어떤 하위 request의 실패가 다른 것의 성공에 영향을 미치지 않는다. 어떤 request라도 실패하면, top-level의 error flag가 true 로 설정되고, 자세한 에러는 관련 있는 request 아래에 나타난다.

POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "Cannot create - it already exists" }
{ "index":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "But we can update it" }

COPY AS CURL VIEW IN SENSE

In the response, we can see that it failed to create document 123 because it already exists, but the subsequent index request, also on document 123, succeeded:

response에서, 123 document가 이미 존재하기 때문에, create 가 실패했으나, 이어지는 indexrequest(역시 123)은 성공했다는 것을 볼 수 있다.

{
   "took": 3,
   "errors": true, 
   "items": [
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "status":   409, 
            "error":    "DocumentAlreadyExistsException 
                        [[website][4] [blog][123]:
                        document already exists]"
      }},
      {  "index": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 5,
            "status":   200 
      }}
   ]
}

COPY AS CURL VIEW IN SENSE

	하나 이상의 request가 실패했다.
	이 request에 대한 HTTP status code는 `409 Conflict` 로 나타났다.
	request가 실패한 이유를 설명하는 에러 메시지.
	두 번째 request는 HTTP status code `200 Ok` 를 가지므로 성공.

That also means that bulk requests are not atomic: they cannot be used to implement transactions. Each request is processed separately, so the success or failure of one request will not interfere with the others.

즉, bulk request는 원자성(atomic)을 보장하지 않는다. transaction을 구현하는데 사용할 수 없다. 각 request는 개별적으로 처리되기 때문에, 어떤 request의 성공, 실패는 다른 request와 무관하다.

Don’t Repeat Yourselfedit

Perhaps you are batch-indexing logging data into the same index, and with the same type. Having to specify the same metadata for every document is a waste. Instead, just as for the mget API, the bulk request accepts a default /_index or /_index/_type in the URL:

동일한 index, 동일한 type 에 log 데이터를 일괄 색인 한다고 가정해 보자. 모든 document에 동일한 metadata를 지정하는 것은 낭비이다. 대신, mget API처럼, bulk request는 URL에 기본 /_index 또는 /_index/_type 을 사용할 수 있다.

POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }

COPY AS CURL VIEW IN SENSE

You can still override the _index and _type in the metadata line, but it will use the values in the URL as defaults:

URL에 있는 값들은 기본값으로 사용이 되지만, metadata 라인에 _index, _type 을 사용할 수 있다.

POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

COPY AS CURL VIEW IN SENSE

How Big Is Too Big?edit

The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.

전체 bulk request는 request를 받을 node의 메모리에 load되어야 한다. 따라서 request가 클수록, 다른 request들이 이용할 메모리가 줄어든다. bulk request를 위한 적절한 크기가 있다. 그 크기 이상이면 성능은 나아지지 않고, 오히려 떨어질 수도 있다. 그러나, 적절한 크기는 고정된 숫자가 아니다. 전적으로 H/W, document의 크기, 복잡함 그리고 색인/검색의 부하에 달려있다.

Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.

다행히도 가장 효율적인 숫자 는 쉽게 찾을 수 있다. 크기를 점점 증가시키면서, 일괄 처리로, 일반적인 document를 색인해 보자. 성능이 내려가기 시작할 때가 크기가 너무 큰 것이다. 일괄 처리를 시작하기 좋은 지점은 1,000 ~ 5000건의 document 작업이다. 만약 document가 매우 크다면, 조금 더 작은 크기로 하기 바란다.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

bulk request의 물리적인 크기를 계속 지켜보는 것은 때때로 유용하다. 1KB짜리 document 1,000건은 1MB짜리 document 1,000건과 매우 다르다. 알맞은 bulk size는 5 ~ 15MB 정도이다.