2016.09.27 - 번역 - A New Way To Ingest

Blog

2016.09.27 - 번역 - A New Way To Ingest - Part 1 ...

drscg 2019. 1. 7. 10:32

With the upcoming release of version 5.0 of the Elastic Stack, it is time we took a closer look at how to use one of the new features, Ingest Nodes.

Elastic Stack 5.0의 출시가 임박하면서, 새로운 기능 중 하나인 Ingest Node의 사용법을 알아보자.

What are Ingest Nodes?

Ingest Nodes are a new type of Elasticsearch node you can use to perform common data transformation and enrichments.

Ingest Node는 일반적인 data의 변환 및 개선을에 사용할 수 있는 Elasticsearch node의 새로운 형태이다.

Each task is represented by a processor. Processors are configured to form pipelines.

각 task는 processor로 표시된다. processor는 pipeline을 형성하도록 구성된다.

At the time of writing the Ingest Node had 20 built-in processors, for example grok, date, gsub, lowercase/uppercase, remove and rename. You can find a full list in the documentation.

만들 당시에, Ingest Node는 20개의 built-in processor(grok, date, gsub, lowercase/uppercase, remove, rename 등)가 있었다. 참고 문서에서 전체 목록을 볼 수 있다.

Besides those, there are currently also three Ingest plugins:

그외에도, 현재 3개의 Ingest plugin이 있다.

Ingest Attachment converts binary documents like Powerpoints, Excel Spreadsheets, and PDF documents to text and metadata
Ingest Attachment는 powerpoint, excel, pdf 같은 binary document를 text와 metadata로 바꾼다.
Ingest Geoip looks up the geographic locations of IP addresses in an internal database
Ingest Geoip 는 내부 database에서 IP address의 지리적 위치를 검색한다.
Ingest user agent parses and extracts information from the user agent strings used by browsers and other applications when using HTTP
Ingest user agent 는 HTTP를 사용할 때 browser 및 다른 application에 의해 사용되는 user agent string에서 정보를 parsing하고 추출한다.

Create and Use an Ingest Pipeline

You configure a new ingest pipeline with the _ingest API endpoint.

_ingest API를 사용하여 새로운 ingest pipeline을 설정할 수 있다.

Note: Throughout this blog post, when showing requests to Elasticsearch we are using the format of Console.

이 게시물에서는 Elasticsearch에 보내는 request를 보여줄 때, Console 형식을 사용한다.

PUT _ingest/pipeline/rename_hostname
{
  "processors": [
    {
      "rename": {
        "field": "hostname",
        "target_field": "host",
        "ignore_missing": true
      }
    }
  ]
}

In this example, we configure a pipeline called rename_hostname that simply takes the field hostname and renames it to host. If the hostname field does not exist, the processor continues without error.

이 예에서는, 단순하게 hostname field를 가져와 host 로 그 이름을 바꾸는, rename_hostname 라는 pipeline을 구성했다. hostname field가 없어도, processor는 error 없이 진행된다.

To use this pipeline, there’s several ways.

이 pipeline을 사용하는 다양한 방법이 있다.

When using plain Elasticsearch APIs, you specify the pipeline parameter in the query string, e.g.:

단순히 Elasticsearch API를 사용하는 경우, 다음과 같이, query string에서 pipeline 매개변수를 지정해야 한다.

POST server/values/?pipeline=rename_hostname
{
  "hostname": "myserver"
}

In Logstash, you add the pipeline parameter to the elasticsearch output:

Logstash에서는 elasticsearch output에 pipeline 매개변수를 추가한다.

output {
  elasticsearch {
    hosts => "192.168.100.39"
    index => "server"
    pipeline => "rename_hostname"
  }
}

Similarly, you add a parameter to the elasticsearch output of any Beat:

유사하게, 다른 Beat의 elasticsearch output에 매개변수를 추가한다.

output.elasticsearch:
  hosts: ["192.168.100.39:9200"]
  index: "server"
  pipeline: "convert_value"

Note: In alpha versions of 5.0, you had to use parameters.pipeline in the Beats configuration.

5.0 alpha version에서는, Beats 설정에 parameters.pipeline 을 사용해야 했었다.

Simulate

When configuring a new pipeline, it is often very valuable to be able to test it before feeding it with real data - and only then discovering that it throws an error!

새로운 pipeline을 구성할 때, 실제 data에 적용하기 전에 test하고 그것에서 오류를 발견하는 것은 매우 유용하다.

For that, there is the Simulate API:

이를 위해 Simulate API 이 있다.

POST _ingest/pipeline/rename_hostname/_simulate
{
  "docs": [
    {
      "_source": {
        "hostname": "myserver"
      }
    }
  ]
}

The result shows us that our field has been successfully renamed:

결과를 살펴보면, filed의 이름이 성공적으로 바뀌었다.

       [...]
        "_source": {
          "host": "myserver"
        },
        [...]

A real-world example: Web Logs

Let’s turn to something from the real world: Web logs.

실제 data(web logs)로 뭔가를 해 보자.

This is an example of an access log in the Combined Log Format supported by both Apache httpd and nginx:

이것은 Apache httpd와 nginx 모두에서 지원하는 Combined Log Format의 access log의 예이다.

212.87.37.154 - - [12/Sep/2016:16:21:15 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"

As you can see, it contains several pieces of information: IP address, timestamp, a user agent string, and so on.

보시다시피, IP, timestamp, user agent string 등 여러 가지 정보를 가지고 있다.

To allow fast search and visualisation we need to give every piece its own field in Elasticsearch. It would also be useful to know where this request is coming from. We can do all this with the following Ingest pipeline.

빠른 검색과 시각화를 위해, Elasticsearch의 각 field에 모든 정보를 넣어야 한다. 또한 그것은 이 request가 어디에서 요청되었는지를 파악하는데 유용하다. 이 모든 것을 다음 ingest pipeline으로 할 수 있다.

PUT _ingest/pipeline/access_log
{
  "description" : "Ingest pipeline for Combined Log Format",
  "processors" : [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \\[%{HTTPDATE:timestamp}\\] \"%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}"]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [ "dd/MMM/YYYY:HH:mm:ss Z" ]
      }
    },
    {
      "geoip": {
        "field": "clientip"
      }
    },
    {
      "user_agent": {
        "field": "agent"
      }
    }
  ]
}

It contains a total of four processors:

모두 4개의 processor를 가지고 있다.

grok uses a regular expression to parse the whole log line into individual fields.
grok 는 전체 log를 개별 field로 parsing하기 위하여 정규식을 사용한다.
date identifies the timestamp of the document.
date 는 document의 timestamp를 확인한다.
geoip takes the IP address of the requester and looks it up in an internal database to determine its geographical location.
geoip 는 요청자의 IP를 가져와 내부 database에서 검색하여, 그것의 지리적 위치를 결정한다.
user_agent takes the user agent string and splits it up into individual components.
user_agent user agent string을 가져와 개별 요소로 분할한다.

Since the last two processors are plugins that do not ship with Elasticsearch by default we will have to install them first:

마지막 2개의 processor는 Elasticsearch가 기본적으로 제공하는 plugin이 아니므로, 먼저 그것을 설치해야 한다.

bin/elasticsearch-plugin install ingest-geoip
bin/elasticsearch-plugin install ingest-user-agent

To test our pipeline, we can again use the Simulate API (the double quotes inside message have to be escaped):

pipeline을 test하기 위하여, Simulate API를 사용할 수 있다. (message 내 double quote는 escape해야 한다.)

POST _ingest/pipeline/access_log/_simulate
{
  "docs": [
    {
      "_source": {
        "message": "212.87.37.154 - - [12/Sep/2016:16:21:15 +0000] \"GET /favicon.ico HTTP/1.1\" 200 3638 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\""
      }
    }
  ]
}

The result from Elasticsearch shows us that this worked:

결과는 다음과 같다.

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "request": "/favicon.ico",
          "agent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\"",
          "geoip": {
            "continent_name": "Europe",
            "city_name": null,
            "country_iso_code": "DE",
            "region_name": null,
            "location": {
              "lon": 9,
              "lat": 51
            }
          },
          "auth": "-",
          "ident": "-",
          "verb": "GET",
          "message": "212.87.37.154 - - [12/Sep/2016:16:21:15 +0000] \"GET /favicon.ico HTTP/1.1\" 200 3638 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\"",
          "referrer": "\"-\"",
          "@timestamp": "2016-09-12T16:21:15.000Z",
          "response": 200,
          "bytes": 3638,
          "clientip": "212.87.37.154",
          "httpversion": "1.1",
          "user_agent": {
            "patch": "2743",
            "major": "52",
            "minor": "0",
            "os": "Mac OS X 10.11.6",
            "os_minor": "11",
            "os_major": "10",
            "name": "Chrome",
            "os_name": "Mac OS X",
            "device": "Other"
          },
          "timestamp": "12/Sep/2016:16:21:15 +0000"
        },
        "_ingest": {
          "timestamp": "2016-09-13T14:35:58.746+0000"
        }
      }
    }
  ]
}
Read Less

In the second part we will show how to set up an ingestion pipeline using Filebeat, Elasticsearch and Kibana to ingest and visualize web logs.

2번째 part에서는, web log를 수집하고 시각화하기 위하여, FileBeat, Elasticsearch, Kibana를 이용한 ingest pipeline 구성 방법을 설명하겠다.

원문 : A New Way To Ingest - Part 1

저작자표시 (새창열림)

'Blog' 카테고리의 다른 글

2016.11.11 - 번역 - Every shard deserves a home ... (0)	2019.01.07
2016.09.29 - 번역 - Elasticsearch as a column store ... (0)	2019.01.07
2016.09.19 - 번역 - Instant Aggregations: Rewriting Queries for Fun and Profit ... (0)	2019.01.07
2016.09.14 - 번역 - Instant Aggregations: The Great Query Refactoring: Thou shalt only parse once ... (0)	2019.01.07
2016.09.13 - 번역 - Instant Aggregations: The tale of caching and why it matters ... (0)	2019.01.07

현재글2016.09.27 - 번역 - A New Way To Ingest - Part 1 ...

elasticsearch, definitive guide

Shard, Relevance, MATCH, Type, cache, Mapping, index, score, Size, phrase, full-text, Filter, parent, inverted, Query, replica, json, Term, primary, Cluster,

Today :
Yesterday :

不爲也比不能也