3-7-5. Phonetic Matching

2.X/3. Dealing with Human Language

3-7-5. Phonetic Matching

drscg 2017. 9. 24. 11:59

In a last, desperate, attempt to match something, anything, we could resort to searching for words that sound similar, even if their spelling differs.

마지막으로, 뭔가에 일치하려는 필사적인 시도로써, 심지어 철자가 다르더라도, 비슷하게 들리는 단어에 대한 검색에 의존할 수 있다.

Several algorithms exist for converting words into a phonetic representation. The Soundex algorithm is the granddaddy of them all, and most other phonetic algorithms are improvements or specializations of Soundex, such as Metaphone and Double Metaphone (which expands phonetic matching to languages other than English), Caverphone for matching names in New Zealand, theBeider-Morse algorithm, which adopts the Soundex algorithm for better matching of German and Yiddish names, and the Kölner Phonetik for better handling of German words.

단어를 음성 표현으로 바꾸기 위한, 다수의 알고리즘이 존재한다. Soundex 알고리즘은 그들 중 최초이며, 다른 대부분의 음성 알고리즘은 Soundex의 개선 또는 특수 버전이다. 예를 들자면, Metaphone와 Double Metaphone(영어 이외의 언어로 음성 일치를 확장), 뉴질랜드(New Zealand)에서 이름에 일치시키려는 Caverphone, 독일어(German)와 이디시어(Yiddish) 이름을 더 잘 일치시키기 위해 Soundex 알고리즘을 채택한 Beider-Morse 알고리즘, 독일어(German) 단어를 더 잘 다루기 위한 Kölner Phonetik 등이 있다.

The thing to take away from this list is that phonetic algorithms are fairly crude, and very specific to the languages they were designed for, usually either English or German. This limits their usefulness. Still, for certain purposes, and in combination with other techniques, phonetic matching can be a useful tool.

이것들을 폄하하자면, 음성 알고리즘은 꽤 상당히 대충 만들었고, 일반적으로 영어(English)나 독일어(German)를 위해 설계된, 매우 언어 지향적이라는 것이다. 이것이 그들의 유용성을 제한한다. 그럼에도 불구하고, 특정 목적을 위해서, 다른 기술과 결합하면, 음성 일치는 유용한 도구가 될 수 있다.

First, you will need to install the Phonetic Analysis plug-in fromhttps://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html on every node in the cluster, and restart each node.

먼저, https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html 에서, 음성 분석 plugin을 cluster의 모든 node에 설치해야 한다. 그리고 각 node를 다시 시작한다.

Then, you can create a custom analyzer that uses one of the phonetic token filters and try it out:

다시 시작했다면, phonetic token filter 중의 하나를 사용하는 사용자 정의 analyzer를 생성할 수 있다.

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "dbl_metaphone": { 
          "type":    "phonetic",
          "encoder": "double_metaphone"
        }
      },
      "analyzer": {
        "dbl_metaphone": {
          "tokenizer": "standard",
          "filter":    "dbl_metaphone" 
        }
      }
    }
  }
}

	먼저, `double_metaphone` encoder를 사용하는, 사용자 정의 `phonetic` token filter를 설정한다.
	그리고 사용자 정의 analyzer에 사용자 정의 token filter를 사용한다.

Now we can test it with the analyze API:

이제, analyze API로, 그것을 테스트할 수 있다.

GET /my_index/_analyze?analyzer=dbl_metaphone
Smith Smythe

Each of Smith and Smythe produce two tokens in the same position: SM0 and XMT. Running John, Jon, and Johnnie through the analyzer will all produce the two tokens JN and AN, while Jonathonresults in the tokens JN0N and ANTN.

Smith 와 Smythe 각각은, 동일한 위치에, 두 개의 token(SM0, XMT)을 만들어낸다. analyzer를 통해 John, Jon, Johnnie 를 실행하면, 모두 두 개의 token(JN, AN)을 만들어내고, 반면에 Jonathon 은 JNON 과 ANTN이라는 두 개의 token을 만들어낸다.

The phonetic analyzer can be used just like any other analyzer. First map a field to use it, and then index some data:

phonetic analyzer는 다른 analyzer처럼 사용될 수 있다. 먼저, 그것을 사용하기 위해, field에 mapping하고, 그 다음에 약간의 데이터를 색인하자.

PUT /my_index/_mapping/my_type
{
  "properties": {
    "name": {
      "type": "string",
      "fields": {
        "phonetic": { 
          "type":     "string",
          "analyzer": "dbl_metaphone"
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
  "name": "John Smith"
}

PUT /my_index/my_type/2
{
  "name": "Jonnie Smythe"
}

name.phonetic field는 사용자 정의 dbl_metaphone analyzer를 사용한다.

The match query can be used for searching:

match query는 검색에 사용될 수 있다.

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "name.phonetic": {
        "query": "Jahnnie Smeeth",
        "operator": "and"
      }
    }
  }
}

This query returns both documents, demonstrating just how coarse phonetic matching is. Scoring with a phonetic algorithm is pretty much worthless. The purpose of phonetic matching is not to increase precision, but to increase recall—to spread the net wide enough to catch any documents that might possibly match.

이 query는, 음성 일치가 얼마나 조잡한지를 보여주면서, 두 document 모두를 반환한다. 음성 알고리즘으로 score를 계산하는 것은 거의 쓸모가 없다. 음성 일치의 목적은 정확성을 증가시키는 것이 아니라, 일치할 가능성이 있는 어떤 document라도 가져오기 위해, 검색 범위를 넓히는, recall을 증가시키는 것이다.

It usually makes more sense to use phonetic algorithms when retrieving results which will be consumed and post-processed by another computer, rather than by human users.

일반적으로, 사람보다는 컴퓨터에 의해 사용되거나 후처리 하는 결과를 가져오는 경우에, 음성 알고리즘을 사용하는 것이 좋다.

저작자표시 비영리 변경금지 (새창열림)

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

3-7. Typoes and Mispelings (0)	2017.09.24
3-7-1. Fuzziness (0)	2017.09.24
3-7-2. Fuzzy Query (0)	2017.09.24
3-7-3. Fuzzy match Query (0)	2017.09.24
3-7-4. Scoring Fuzziness (0)	2017.09.24

현재글3-7-5. Phonetic Matching

elasticsearch, definitive guide

phrase, Type, json, Cluster, index, MATCH, cache, score, full-text, replica, Term, Size, inverted, parent, Mapping, Shard, Relevance, Query, Filter, primary,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

不爲也比不能也