2-2-8. Relevance Is Broken!

2.X/2. Search in Depth

2-2-8. Relevance Is Broken!

drscg 2017. 9. 30. 01:16

Before we move on to discussing more-complex queries in Multifield Search, let’s make a quick detour to explain why we created our test index with just one primary shard.

Multifield Search에서, 더 복잡한 query에 대해 알아보기 전에, 하나의 primary shard만을 가진 test index를 생성한 이유를 설명하겠다.

Every now and again a new user opens an issue claiming that sorting by relevance is broken and offering a short reproduction: the user indexes a few documents, runs a simple query, and finds apparently less-relevant results appearing above more-relevant results.

예전에, 어떤 사용자가 relevance에 의한 정렬이 이상하다는 주장을 하며, 빠른 수정을 요구하는 이슈를 제기하였다. 그는 약간의 document를 index하고, 간단한 query를 실행했다. 확실히 덜 관련 있는 결과가 더 관련 있는 결과보다 위에 나타났다.

To understand why this happens, let’s imagine that we create an index with two primary shards and we index ten documents, six of which contain the word foo. It may happen that shard 1 contains three of the foo documents and shard 2 contains the other three. In other words, our documents are well distributed.

이런 결과가 나타나는 이유를 이해하기 위해, 2개의 primary shard를 가진 index를 생성하고, 10개의 document(이 중 6개는 foo 라는 단어를 포함하고 있다)를 색인했다고 가정해 보자. shard 1은 foodocument 중 3개를 가지고 있고, shard 2가 나머지 3개를 가지고 있다고 해 보자. 즉, document가 잘 분산되어 있다.

In What Is Relevance?, we described the default similarity algorithm used in Elasticsearch, called term frequency / inverse document frequency or TF/IDF. Term frequency counts the number of times a term appears within the field we are querying in the current document. The more times it appears, the more relevant is this document. The inverse document frequency takes into account how often a term appears as a percentage of all the documents in the index. The more frequently the term appears, the less weight it has.

What Is Relevance?에서, Elasticsearch에서 사용되는 기본 유사성 알고리즘(TF/IDF)을 설명하였다.TF(Term Frequency) 는, 현재의 document에서, 검색한 field에 어떤 단어가 나타난 횟수를 의미한다. 그 횟수가 많을수록, document는 relevance가 더 높다. IDF(Inverse Document Frequency) 는 index의 모든 document 에서, 어떤 단어가 얼마나 자주 나타나는가를 비율로 계산한다. 단어가 더 많이 나타날수록 비중은 낮다.

However, for performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.

그러나, 성능상의 이유로, Elasticsearch는, index에 있는 모든 document에 대해, IDF를 계산하지는 않는다. 대신, 각 shard는, 해당 shard 가 가지고 있는 document에 대한, local IDF를 계산한다.

Because our documents are well distributed, the IDF for both shards will be the same. Now imagine instead that five of the foo documents are on shard 1, and the sixth document is on shard 2. In this scenario, the term foo is very common on one shard (and so of little importance), but rare on the other shard (and so much more important). These differences in IDF can produce incorrect results.

document가 잘 분산되어 있기 때문에, 두 shard의 IDF는 동일할 것이다. 이제, shard 1이 foo document 중 5개를 가지고 있고, shard 2가 6번째 document를 가지고 있다고 가정해 보자. 이 시나리오에서, foo 라는 단어는 shard 1에서는 매우 흔하지만(그래서 덜 중요하다), shard 2에서는 드문(그래서 훨씬 더 중요한) 단어이다. IDF에서, 이런 차이점이 올바르지 않은 결과를 만들어낼 수 있다.

In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.

실제로 이것은 문제가 아니다. local IDF와 global IDF 사이의 차이는, index에 더 많은 document를 넣는다면, 줄어든다. 실제 상황에서의 데이터 규모를 생각해보면, local IDF는 곧 비슷해 질것이다. 문제점은, relevance이 이상한 것이 아니라 데이터가 너무 적은 것이다.

For testing purposes, there are two ways we can work around this issue. The first is to create an index with one primary shard, as we did in the section introducing the match query. If you have only one shard, then the local IDF is the global IDF.

테스트를 목적으로, 이 문제점을 해결할 수 있는 방법은 두 가지가 있다. 첫 번째는 match query에서 소개한 것처럼, 하나의 primary shard를 가진 index를 생성하는 것이다. 하나의 shard만을 가지면, local IDF가 곧 global IDF이다.

The second workaround is to add ?search_type=dfs_query_then_fetch to your search requests. The dfs stands for Distributed Frequency Search, and it tells Elasticsearch to first retrieve the local IDF from each shard in order to calculate the global IDF across the whole index.

두 번째 해결책은 검색 request에 ?search_type=dfs_query_then_fetch 를 추가하는 것이다. dfs 는 Distributed Frequency Search 를 의미하는데, 이는 Elasticsearch에게 전체 index에서 global IDF를 계산하기 위해, 각 shard로부터 local IDF를 먼저 받으라고 명령한다.

Don’t use dfs_query_then_fetch in production. It really isn’t required. Just having enough data will ensure that your term frequencies are well distributed. There is no reason to add this extra DFS step to every query that you run.

제품에서는, 정말 필요한 경우가 아니라면, dfs_query_then_fetch 를 사용하지 말자. 충분한 데이터를 가지게 되면, TF는 잘 분산될 것이다. 모든 query에 추가적으로 DFS 단계를 추가할 이유가 없다.

'2.X > 2. Search in Depth' 카테고리의 다른 글

2-3-6. Boosting Query Clauses (0)	2017.09.30
2-2-7. Controlling Analysis (0)	2017.09.30
2-3. Multifield Search (0)	2017.09.30
2-3-01. Multiple Query Strings (0)	2017.09.30
2-3-02. Single Query String (0)	2017.09.30

현재글2-2-8. Relevance Is Broken!

elasticsearch, definitive guide

Size, Cluster, inverted, phrase, json, Mapping, score, Shard, Filter, cache, Type, Term, parent, full-text, replica, index, Relevance, Query, MATCH, primary,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

不爲也比不能也