当前位置: 首页>>代码示例>>Python>>正文


Python WordNetLemmatizer.isdigit方法代码示例

本文整理汇总了Python中nltk.stem.wordnet.WordNetLemmatizer.isdigit方法的典型用法代码示例。如果您正苦于以下问题:Python WordNetLemmatizer.isdigit方法的具体用法?Python WordNetLemmatizer.isdigit怎么用?Python WordNetLemmatizer.isdigit使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在nltk.stem.wordnet.WordNetLemmatizer的用法示例。


在下文中一共展示了WordNetLemmatizer.isdigit方法的1个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: tokenize_document

# 需要导入模块: from nltk.stem.wordnet import WordNetLemmatizer [as 别名]
# 或者: from nltk.stem.wordnet.WordNetLemmatizer import isdigit [as 别名]
def tokenize_document(doc, is_ted, is_only_nouns):
    """
    For a given string text, the script get the text's tokens. The text is pre-processd and filtered, after that the NLTK tokenizer
    process is carried out, if a flag is enabled, the tokens are tagged and filtered out only the nouns and finally the tokens are
    lemmatized.
    PARAMETERS:
       1. doc: The string text from which extract the tokens
       2. is_ted: A flag to say if to add to the english standard stopword the custom stopwords prepared for the TED talks corpus
       3. is_only_nouns: A flag to say if extract only the tokens tagged as a nouns
    RETURNS:
       A list of strings where each string is a token from the given text
    """
    res = []
    
    try: 
        # First pre-process and filter the given text
        doc2= remove_punctuation_stopwords(doc, is_ted)
        # From the pre-proccesed and filtered text apply the NLTK tokenizer process
        tokens = PunktWordTokenizer().tokenize(' '.join(doc2))
        # If enabled the flag, then only extract the tokens tagged as a nouns
        if is_only_nouns:
            tagged_tokens = nltk.pos_tag(tokens)
            tokens = []
            for token, tag in tagged_tokens:
                if (tag == 'NN') or (tag == 'NNP') or (tag == 'NNS'):
                    tokens.append(token)
        # Lemmatize the tokens using the NLTK lemmatizer
        for i in range(0,len(tokens)):
            lema = WordNetLemmatizer().lemmatize(tokens[i])
            # If the token was not lemmatized, then apply verb lemmatization
            if lema == tokens[i]:
                lema = WordNetLemmatizer().lemmatize(tokens[i], 'v')
            if (len(lema) > 1) and (not lema.isdigit()):
                # Append the lema to the result to be returned
                res.append(lema)
    except:
        print "tokenize_document"
        print ""
        traceback.print_exc()

    return res
开发者ID:uyjco0,项目名称:ted-scimu,代码行数:43,代码来源:base.py


注:本文中的nltk.stem.wordnet.WordNetLemmatizer.isdigit方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。