3-4. Reducing Words to Their Root Form

2.X/3. Dealing with Human Language

3-4. Reducing Words to Their Root Form

drscg 2017. 9. 24. 13:23

Most languages of the world are inflected, meaning that words can change their form to express differences in the following:

세상의 모든 언어는 어형이 굴절(inflect) 된다. 즉, 단어의 차이점을 표현하기 위해, 단어의 형태를 변경할 수 있다.

Number: fox, foxes
Tense: pay, paid, paying
Gender: waiter, waitress
Person: hear, hears
Case: I, me, my
Aspect: ate, eaten
Mood: so be it, were it so

While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters. English is a weakly inflected language (you could ignore inflections and still get reasonable search results), but some other languages are highly inflected and need extra work in order to achieve high-quality search results.

굴절(inflection)은 표현에 도움이 되지만, 문자의 다양한 순서에 의해 표현되는 단일 원형의 뜻(word sense) 이나 의미로서는, 검색 능력에 방해가 된다. 영어는 약한 굴절 언어이다. 굴절을 무시하고도, 여전히 적절한 검색 결과를 얻을 수 있다. 그러나, 어떤 다른 언어는 심하게 굴절되어, 고품질의 검색 결과를 얻기 위해, 추가 작업이 필요하다.

Stemming attempts to remove the differences between inflected forms of a word, in order to reduce each word to its root form. For instance foxes may be reduced to the root fox, to remove the difference between singular and plural in the same way that we removed the difference between lowercase and uppercase.

형태소 분석(stemming) 은, 각 단어를 원형으로 축소하기 위하여, 단어의 굴절된 형태 사이의 차이점을 제거하는 것이다. 예를 들면, 대소문자 사이의 차이점을 제거하는 것과 동일한 방식으로, 단수와 복수 사이의 차이점을 제거하기 위하여, foxes 는 원형 fox 로 축소될 것이다.

The root form of a word may not even be a real word. The words jumping and jumpiness may both be stemmed to jumpi. It doesn’t matter—as long as the same terms are produced at index time and at search time, search will just work.

단어의 원형은 진정한 단어가 아닐 수도 있다. jumping 과 jumpiness 는 둘 모두 jumpi 로 형태소 분석이 이루어진다. 색인 시와 검색 시에 만들어진 단어가 동일한 한, 이것은 관계없다. 검색은 잘 동작한다.

If stemming were easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.

형태소 분석이 쉽다면, 단 하나의 구현만이 있을 것이다. 불행히도, 형태소 분석은 2가지 이슈(부족한 형태소 분석과 과한 형태소 분석)에 시달리는 부정확한 과학이다.

Understemming is the failure to reduce words with the same meaning to the same root. For example, jumped and jumps may be reduced to jump, while jumping may be reduced to jumpi. Understemming reduces retrieval; relevant documents are not returned.

부족한 형태소 분석(understemming) 은, 동일한 의미를 가진 단어를, 동일한 원형으로 축소하는데 실패한 것이다. 예를 들자면, jumped 와 jumps 는 jump 로 축소된다. 반면에 jumping 은 jumpi 로 축소될 수 있다. 부족한 형태소 분석을 하면 검색 결과가 감소한다. 즉, 관련 있는 document가 반환되지 않는다.

Overstemming is the failure to keep two words with distinct meanings separate. For instance, general and generate may both be stemmed to gener. Overstemming reduces precision: irrelevant documents are returned when they shouldn’t be.

과한 형태소 분석(overstemming) 은, 고유한 의미를 가진 두 개의 단어를, 개별적으로 유지하는데 실패한 것이다. 예를 들자면, general 과 generate 는 둘 모두 gener 로 형태소 분석이 될 것이다. 과한 형태소 분석을 하면 정확성이 감소한다. 즉, 반환하지 않아도 되는 관련 없는 document를 반환한다.

기본형 찾기(lemmatisation)

A lemma is the canonical, or dictionary, form of a set of related words—the lemma of paying, paid, and pays is pay. Usually the lemma resembles the words it is related to but sometimes it doesn’t — the lemma of is, was, am, and being is be.

기본형(lemma) 은 관련 단어 집합의 표준 혹은 사전 형태이다. paying, paid, pays 의 기본형은 pay 이다. 일반적으로 기본형은 관련된 단어와 유사하나, 가끔은 그렇지 않다. is, was, being, am 의 기본형은 be 이다.

Lemmatization, like stemming, tries to group related words, but it goes one step further than stemming in that it tries to group words by their word sense, or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would try to distinguish these two word senses, stemming would incorrectly conflate them.

형태소 분석과 마찬가지로, 기본형 찾기는 관련 단어를 분류하려 하나, 단어의 뜻 이나 의미에 의해 단어를 분류하려는 형태소 분석보다, 1단계 더 나간다. 동일한 단어는 2개의 다른 의미를 표현할 수 있다. 예를 들어, wake 는 to wake up(깨우는) 또는 a funeral(장례식) 을 의미한다. 기본형 찾기는 이들 두 단어의 의미를 구분하려 하지만, 형태소 분석은 그들을 부정확하게 하나로 본다.

Lemmatization is a much more complicated and expensive process that needs to understand the context in which words appear in order to make decisions about what they mean. In practice, stemming appears to be just as effective as lemmatization, but with a much lower cost.

기본형 찾기는, 단어가 의미하는 바를 결정하기 위하여, 단어가 나타나는 문맥을 이해해야 하는, 훨씬 더 복잡하고 비용이 많이 발생하는 프로세스다. 실제로 형태소 분석은 훨씬 낮은 비용으로, 기본형 찾기만큼 효과적이다.

First we will discuss the two classes of stemmers available in Elasticsearch—Algorithmic Stemmersand Dictionary Stemmers—and then look at how to choose the right stemmer for your needs in Choosing a Stemmer. Finally, we will discuss options for tailoring stemming in Controlling Stemmingand Stemming in situ.

먼저, Elasticsearch에서 이용할 수 있는 두 종류의 형태소 분석기(Algorithmic Stemmers와 Dictionary Stemmers)에 대해 이야기할 것이다. 그리고 나서, Choosing a Stemmer에서, 필요한 따라 적절한 형태소 분석기를 선택하는 방법을 살펴보자. 마지막으로, Controlling Stemming와 Stemming in situ에서 형태소 분석을 조정하는 옵션에 대해 이야기해 보자.

저작자표시 비영리 변경금지

'2.X > 3. Dealing with Human Language' 카테고리의 다른 글

3-3-5. Unicode Character Folding (0)	2017.09.24
3-3-6. Sorting and Collations (0)	2017.09.24
3-4-1. Algorithmic Stemmers (0)	2017.09.24
3-4-2. Dictionary Stemmers (0)	2017.09.24
3-4-3. Hunspell Stemmer (0)	2017.09.24

현재글3-4. Reducing Words to Their Root Form

elasticsearch, definitive guide

Cluster, cache, Relevance, phrase, Shard, json, index, Filter, replica, Type, Term, primary, Size, full-text, Query, MATCH, parent, score, inverted, Mapping,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

不爲也比不能也