2.X/3. Dealing with Human Language 49

3-6-4. Synonyms and The Analysis Chain

The example we showed in Formatting Synonyms, used u s a as a synonym. Why did we use that instead of U.S.A.? The reason is that the synonym token filter sees only the terms that the previous token filter or tokenizer has emitted.Formatting Synonyms에서 보여준 예제에서, 동의어로서 u s a 를 사용했다. 왜 U.S.A. 대신 저것을 사용했을까? 그 이유는 synonym token filter 만이 직전의 token filter나 tokenizer가 출력한 단어를 알기 때문이다.Imagine that we ha..

3-6-5. Multiword Synonyms and Phrase Queries

So far, synonyms appear to be quite straightforward. Unfortunately, this is where things start to go wrong. For phrase queries to function correctly, Elasticsearch needs to know the position that each term occupies in the original text. Multiword synonyms can play havoc with term positions, especially when the injected synonyms are of differing lengths.지금까지는, 동의어가 매우 간단한 것으로 보인다. 유감스럽게도, 이것이 문제의..

3-6-6. Symbol Synonyms

The final part of this chapter is devoted to symbol synonyms, which are unlike the synonyms we have discussed until now. Symbol synonyms are string aliases used to represent symbols that would otherwise be removed during tokenization.이 장의 마지막은, 지금까지 이야기했던 동의어와는 다른, symbol synonym이다. 상징의 동의어(Symbol Synonyms) 는 token을 만드는 도중에 제거되는, 상징을 표시하는데 사용되는 문자열로 이루어진 별칭이다.While most punctuation is seldom imp..

3-7. Typoes and Mispelings

We expect a query on structured data like dates and prices to return only documents that match exactly. However, good full-text search shouldn’t have the same restriction. Instead, we can widen the net to include words that may match, but use the relevance score to push the better matches to the top of the result set.날짜나 가격 같은, 구조화된 데이터에 대한 query는 정확히 일치하는 document만 반환하기를 기대한다.그러나, 좋은 full-text ..

3-7-1. Fuzziness

Fuzzy matching treats two words that are "fuzzily" similar as if they were the same word. First, we need to define what we mean by fuzziness.퍼지 일치(fuzzy matching) 는 두 단어가 동일한 단어인 것처럼, "애매하게(fuzzily)" 유사한 두 단어를 다룬다. 먼저, fuzziness 의 의미를 정의해야 한다.In 1965, Vladimir Levenshtein developed the Levenshtein distance, which measures the number of single-character edits required to transform one word into t..

3-7-2. Fuzzy Query

The fuzzy query is the fuzzy equivalent of the term query. You will seldom use it directly yourself, but understanding how it works will help you to use fuzziness in the higher-level match query.fuzzy query는 term query와 어떤 면에서 유사하다. 그것을 직접 사용할 리는 거의 없을 것이다. 그러나, 그것이 동작하는 방법을 이해하면, 높은 수준의 match query에서, fuzziness를 사용하는데 도움이 될 것이다.To understand how it works, we will first index some documents:그것이 ..

3-7-3. Fuzzy match Query

The match query supports fuzzy matching out of the box:match query는 기본적으로 퍼지 일치를 지원한다.GET /my_index/my_type/_search { "query": { "match": { "text": { "query": "SURPRIZE ME!", "fuzziness": "AUTO", "operator": "and" } } } }The query string is first analyzed, to produce the terms [surprize, me], and then each term is fuzzified using the specified fuzziness.query string은, 단어 [surprise, me] 를 만들어내기 위..

3-7-4. Scoring Fuzziness

Users love fuzzy queries. They assume that these queries will somehow magically find the right combination of proper spellings. Unfortunately, the truth is somewhat more prosaic.사용자들은 fuzzy query를 좋아한다. 이들 query가 적절한 맞춤법의 올바른 조합을, 왠지 마술처럼 찾을 거라 생각한다. 유감스럽게도, 진실은 어느 정도 더 평범하다.Imagine that we have 1,000 documents containing "Schwarzenegger", and just one document with the misspelling "Schwarzenege..

3-7-5. Phonetic Matching

In a last, desperate, attempt to match something, anything, we could resort to searching for words that sound similar, even if their spelling differs.마지막으로, 뭔가에 일치하려는 필사적인 시도로써, 심지어 철자가 다르더라도, 비슷하게 들리는 단어에 대한 검색에 의존할 수 있다.Several algorithms exist for converting words into a phonetic representation. The Soundex algorithm is the granddaddy of them all, and most other phonetic algorithms are improv..