3-7-1. Fuzziness

2.X/3. Dealing with Human Language

3-7-1. Fuzziness

drscg 2017. 9. 24. 12:31

Fuzzy matching treats two words that are "fuzzily" similar as if they were the same word. First, we need to define what we mean by fuzziness.

퍼지 일치(fuzzy matching) 는 두 단어가 동일한 단어인 것처럼, "애매하게(fuzzily)" 유사한 두 단어를 다룬다. 먼저, fuzziness 의 의미를 정의해야 한다.

In 1965, Vladimir Levenshtein developed the Levenshtein distance, which measures the number of single-character edits required to transform one word into the other. He proposed three types of one-character edits:

1965년, Vladimir Levenshtein은 하나의 단어를 다른 단어로 변환하는데 필요한, 단일 문자 편집의 횟수를 측정하는, levenshtein distance를 개발하였다. 그는 단일 문자 편집의 세가지 유형을 제안하였다.

Substitution of one character for another: _f_ox → _b_ox
어떤 문자를 다른 문자로 치환(substitution) : _f_ox → _b_ox
Insertion of a new character: sic → sic_k_
새로운 문자를 추가(insertion) : sic → sic_k_
Deletion of a character: b_l_ack → back
특정 문자의 삭제(deletion) : b_l_ack → back

Frederick Damerau later expanded these operations to include one more:

Frederick Damerau는 나중에 이들 연산에 한 가지를 추가하여, 확장했다.

Transposition of two adjacent characters: _st_ar → _ts_ar
인접한 두 문자를 바꾼다(transposition) : _st_ar → _ts_ar

For example, to convert the word bieber into beaver requires the following steps:

예를 들어, bieber 를 beaber 로 바꾸려면, 다음의 단계가 필요하다.

b 를 v 로 치환 : bie_b_er → bie_v_er
i 를 a 로 치환 : b_i_ever → b_a_ever
a 와 e 를 옮김 : b_ae_ver → b_ea_ver

These three steps represent a Damerau-Levenshtein edit distance of 3.

이들 세 단계가 Damerau-Levenshtein edit distance의 세가지이다.

Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

분명히, bieber 와 beaber 는 판이하다. 이것들은 단순한 맞춤법 오류라고 간주할 수는 없다. Damerau는, 사람들의 맞춤법 오류의 80%가 편집 거리(edit distance) 1, 즉, 맞춤법 오류의 80%는, _단일 편집(single edit)_으로, 원래의 문자열로 보정될 수 있다는 것을 발견하였다.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Elasticsearch는, fuzziness 매개변수를 지정하여, 최대 편집 거리 2를 지원한다.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

물론, 문자열에서, 단일 편집이 문자열에 미치는 영향은 문자열의 길이에 달려 있다. 단어 hat 에 대한, 두 번의 편집으로, mad 를 얻을 수 있다. 따라서, 길이가 3인 문자열에 대한 두 번 편집은 지나치다. fuzziness매개변수를 AUTO 로 설정하면, 다음과 같은, 최대 편집 거리로 나타난다.

문자가 1~2개인 문자열: 0
문자가 3~5개인 문자열: 1
문자가 6개 이상인 문자열: 2

Of course, you may find that an edit distance of 2 is still overkill, and returns results that don’t appear to be related. You may get better results, and better performance, with a maximum fuzziness of 1.

물론, 편집 거리 2 는 여전히 지나치고, 관련되었다고 보이지 않는 결과가 반환될지도 모른다. 최대 fuzziness 1 로, 더 나은 결과와 더 나은 성능을 얻을지도 모른다.

저작자표시 비영리 변경금지

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

3-6-6. Symbol Synonyms (0)	2017.09.24
3-7. Typoes and Mispelings (0)	2017.09.24
3-7-2. Fuzzy Query (0)	2017.09.24
3-7-3. Fuzzy match Query (0)	2017.09.24
3-7-4. Scoring Fuzziness (0)	2017.09.24

현재글3-7-1. Fuzziness

elasticsearch, definitive guide

phrase, Mapping, Cluster, inverted, cache, Type, replica, parent, primary, full-text, Term, Filter, Size, json, Relevance, MATCH, Query, index, score, Shard,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

不爲也比不能也

3-7-1. Fuzziness

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

'2.X/3. Dealing with Human Language'의 다른글

티스토리툴바

3-7-1. Fuzziness

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

'2.X/3. Dealing with Human Language'의 다른글

관련글

티스토리툴바