词形还原一般是指:
loving => love helping => help helps => help scaling => scale cars => car cats => cat
在这篇文章中,我们将使用NLTK
中的WordNetLemmatizer
对句子进行词形还原。 该词形还原器接受输入字符串并尝试对其进行词形还原,默认情况下,如果您传入一个单词,它会将其视为名词进行词形还原。
所以,简单词形还原的如下:
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'
其中, ‘v’是指单词的词性为“动词”。可以看到,如果不给定单词的词形,词形还原的结果可能不是你想要的。那么,对于一个完整的句子,如何根据上下文做好词形还原呢?
为了使词形还原更好地依赖于句子的上下文,我们需要找出词性标记(如动词、名词、形容词等),并将其传递给词形还原器。 做法流程为,首先使用 NLTK 的post_tag
找出每个单词的 POS 标记(也就是词性标记),然后使用这个标记在 WordNet 中查找相应的词性,最后使用词形还原器根据词性标记对标记进行词形还原。Python代码示例如下:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
def lemmatize_sentence(sentence):
#tokenize the sentence and find the POS tag for each token
nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
#tuple of (token, wordnet_tag)
wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if tag is None:
#if there is no available tag, append the token as is
lemmatized_sentence.append(word)
else:
#else use the tag to lemmatize the token
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return " ".join(lemmatized_sentence)
print(lemmatizer.lemmatize("I am loving it")) #输出:I am loving it
print(lemmatizer.lemmatize("loving")) #输出:loving
print(lemmatizer.lemmatize("loving", "v")) #输出:love
print(lemmatize_sentence("I am loving it")) #输出:I be love it
以上就是一个较完整的句子词形还原例程了。
补充,最近从cnblogs找到了另外一个版本,做法比较类似。不同点在于,上文版本中将没有找到词性的词直接输出;而下面的版本,将没有找到词性的单词作为名词进行词形还原处理之后再输出。示例如下:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
# 获取单词的词性
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return None
sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
tokens = word_tokenize(sentence) # 分词
tagged_sent = pos_tag(tokens) # 获取单词词性
wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原
print(lemmas_sent)
输出:
['football', 'be', 'a', 'family', 'of', 'team', 'sport', 'that', 'involve', ',', 'to', 'vary', 'degree', ',', 'kick', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']
参考资料:
[1]: Lemmatize whole sentences with Python and nltk’s WordNetLemmatizer
[2]:NLTK WordNet Lemmatizer: Shouldn’t it lemmatize all inflections of a word?