Python PunktSentenceTokenizer.tokenize方法代码示例

本文整理汇总了Python中nltk.tokenize.punkt.PunktSentenceTokenizer.tokenize方法的典型用法代码示例。如果您正苦于以下问题：Python PunktSentenceTokenizer.tokenize方法的具体用法？Python PunktSentenceTokenizer.tokenize怎么用？Python PunktSentenceTokenizer.tokenize使用的例子？那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类nltk.tokenize.punkt.PunktSentenceTokenizer的用法示例。

在下文中一共展示了PunktSentenceTokenizer.tokenize方法的15个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: BasePunktWordTokenizer

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
class BasePunktWordTokenizer(BaseWordTokenizer):
    """Base class for punkt word tokenization"""

    def __init__(self, language: str = None, sent_tokenizer:object = None):
        """
        :param language : language for sentence tokenization
        :type language: str
        """
        self.language = language
        super().__init__(language=self.language)
        if sent_tokenizer:
            self.sent_tokenizer = sent_tokenizer()
        else:
            punkt_param = PunktParameters()
            self.sent_tokenizer = PunktSentenceTokenizer(punkt_param)

    def tokenize(self, text: str):
        """
        :rtype: list
        :param text: text to be tokenized into sentences
        :type text: str
        """
        sents = self.sent_tokenizer.tokenize(text)
        tokenizer = TreebankWordTokenizer()
        return [item for sublist in tokenizer.tokenize_sents(sents) for item in sublist]

开发者ID:cltk，项目名称:cltk，代码行数:27，代码来源:word.py

示例2: _split_sentences

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
 def _split_sentences(self, text):
     from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
     punkt_param = PunktParameters()
     punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
     sentence_splitter = PunktSentenceTokenizer(punkt_param)
     sentences = sentence_splitter.tokenize(text)
     return sentences

开发者ID:siolag161，项目名称:markov_generator，代码行数:9，代码来源:tokenizers.py

示例3: tokenize_sentences

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
    def tokenize_sentences(self, untokenized_string: str):
        """Tokenize sentences by reading trained tokenizer and invoking
        ``PunktSentenceTokenizer()``.
        :type untokenized_string: str
        :param untokenized_string: A string containing one of more sentences.
        :rtype : list of strings
        """
        # load tokenizer
        assert isinstance(untokenized_string, str), \
            'Incoming argument must be a string.'

        if self.language == 'latin':
            tokenizer = super()
        elif self.language == 'greek': # Workaround for regex tokenizer
            self.sent_end_chars=GreekLanguageVars.sent_end_chars
            self.sent_end_chars_regex = '|'.join(self.sent_end_chars)
            self.pattern = rf'(?<=[{self.sent_end_chars_regex}])\s'
        elif self.language in INDIAN_LANGUAGES:
            self.sent_end_chars=SanskritLanguageVars.sent_end_chars
            self.sent_end_chars_regex = '|'.join(self.sent_end_chars)
            self.pattern = rf'(?<=[{self.sent_end_chars_regex}])\s'
        else:
            # Warn that NLTK Punkt is being used by default???
            tokenizer = PunktSentenceTokenizer()

        # mk list of tokenized sentences
        if self.language == 'greek' or self.language in INDIAN_LANGUAGES:
            return re.split(self.pattern, untokenized_string)
        else:
            return tokenizer.tokenize(untokenized_string)

开发者ID:cltk，项目名称:cltk，代码行数:32，代码来源:sentence.py

示例4: textrank

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
def textrank(document):
    pst = PunktSentenceTokenizer()
    sentences = pst.tokenize(document)

    # Bag of Words
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer()
    bow_matrix = cv.fit_transform(sentences)

    from sklearn.feature_extraction.text import TfidfTransformer
    normalized_matrix = TfidfTransformer().fit_transform(bow_matrix)

    ## mirrored matrix where the rows and columns correspond to 
    ## sentences, and the elements describe how similar the
    ## sentences are. score 1 means sentences are exactly the same.
    similarity_graph = normalized_matrix * normalized_matrix.T
    similarity_graph.toarray()

    # PageRank
    import networkx as nx
    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)

    ## mapping of sentence indices to scores. use them to associate
    ## back to the original sentences and sort them
    scores = nx.pagerank(nx_graph)
    ranked = sorted(((scores[i], s) for i,s in enumerate(sentences)), reverse=True)
    print ranked[0][1]

开发者ID:ko，项目名称:random，代码行数:29，代码来源:textrank.py

示例5: summarize

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
 def summarize(self):
     punkt_param = PunktParameters()
     punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
     sentence_splitter = PunktSentenceTokenizer(punkt_param)
     sentences = sentence_splitter.tokenize(self.text)
     structure = {}
     sentence_objects = []
     for idx in range(len(sentences)):
         obj = {'text' : sentences[idx], 'index' : idx , 'data': {}}
         sentence_objects.append(obj)
     structure['sentences'] = sentence_objects
     self.sentencecount = len(structure['sentences'])
     structure['ordered'] = []
     structure['weights'] = {'words' : FreqDist(nltk.word_tokenize(preprocess(self.text))), 'total': 0, 'transformed': 0}
     structure['weights']['total'] = sum(structure['weights']['words'].values())
     self.sentenceIndex = 0
     for each_sent in structure['sentences']:
         each_sent['data']['tokens'] = nltk.word_tokenize(preprocess(each_sent['text']))
         each_sent['data']['sinTransform'] = (1-math.sin(self.sentenceIndex*(math.pi/self.sentencecount)))+1
         for each_word in structure['weights']['words']:
             if each_word in each_sent['data']['tokens']:
                 structure['weights']['words'][each_word] *= each_sent['data']['sinTransform']
         self.sentenceIndex += 1
     structure['weights']['transformed'] = sum(structure['weights']['words'].values())
     self.sentenceIndex = 0
     for each_sent in structure['sentences']:
         each_sent['data']['weights'] = {'words': self.calculate_relative_frequence(each_sent['data']['tokens'], structure['weights']['words']), 'total': 0}
         each_sent['data']['weights']['total'] = sum(each_sent['data']['weights']['words'].values())
         self.sentenceIndex += 1
     structure['ordered'] = sorted(structure['sentences'], key=lambda x:x['data']['weights']['total'], reverse=True)
     structure_keep = structure['ordered'][:self.quota]
     structure_keep.sort(key=lambda x:x['index'])
     for eac_sen in structure_keep:
         self.summary.append(eac_sen['text'])

开发者ID:yagamiram，项目名称:NLP_Auto_Summarization，代码行数:36，代码来源:fractal_template.py

示例6: get_key_sentences

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
    def get_key_sentences(self, n=5):
        '''
        Uses a simple implementation of TextRank to extract the top N sentences
        from a document.

        Sources:
        - Original paper: http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf
        - Super useful blog post: http://joshbohde.com/blog/document-summarization
        - Wikipedia: http://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_keyphrase_extraction:_TextRank
        '''
        # Tokenize the document into sentences. More NLP preprocesing should also happen here. 
        sentence_tokenizer = PunktSentenceTokenizer()
        sentences = sentence_tokenizer.tokenize(self.doc)

        # Calculate word counts and TFIDF vectors
        word_counts = CountVectorizer(min_df=0).fit_transform(sentences)
        normalized = TfidfTransformer().fit_transform(word_counts) 

        # Normalized graph * its transpose yields a sentence-level similarity matrix
        similarity_graph = normalized * normalized.T
     
        nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
        scores = nx.pagerank(nx_graph)
        return sorted(((scores[i],s) for i,s in enumerate(sentences)),
                      reverse=True)[n]

开发者ID:joannaskao，项目名称:judgmental，代码行数:27，代码来源:metadata.py

示例7: get_todo_items

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
def get_todo_items(text):
    all_items = list()
    tokenizer = PunktSentenceTokenizer()
    sen_tokens = tokenizer.tokenize(text)

    for sen_token in sen_tokens:
        todo_items = list()
        tokens = nltk.word_tokenize(sen_token)

        tags = tagger.tag(tokens)
        stop_words = [word for (word, tag) in tags if tag in (tagVB, tagVBP)]

        ind = -1
        for word in stop_words:
            curr_ind = tokens.index(word)
            if curr_ind != 0 and tags[curr_ind - 1][1] in (tagCC, tagRB):
                to_ind = curr_ind - 1
            else: to_ind = curr_ind
            if ind != -1 and abs(to_ind - ind) > 1:
                todo_items.append(' '.join(tokens[ind:get_punctuation_index(tokens, ind, to_ind)]))
            elif ind != -1 and len(todo_items) > 0:
                last_ind = len(todo_items)
                todo_items[last_ind - 1] = ' '.join([todo_items[last_ind - 1], tokens[to_ind - 1]])
            ind = curr_ind

        if ind != -1 and abs(len(tokens) - ind) > 1:
            todo_items.append(' '.join(tokens[ind:get_punctuation_index(tokens, ind, len(tokens))]))
        elif ind != -1 and len(todo_items) > 0:
            last_ind = len(todo_items)
            todo_items[last_ind - 1] = ' '.join([todo_items[last_ind - 1], tokens[len(tokens) - 1]])

        all_items.extend(todo_items)

    return all_items

开发者ID:jumutc，项目名称:smartchecklist，代码行数:36，代码来源:calculations.py

示例8: fractal_representation

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
 def fractal_representation(self):
     punkt_param = PunktParameters()
     for each_paragraph in self.paragraphs:
         buffer_p = paragraph()
         buffer_p.paragraph = each_paragraph
         buffer_p.tokens = nltk.word_tokenize(preprocess(each_paragraph))
         buffer_p.weights['words'] = FreqDist(buffer_p.tokens)
         buffer_p.weights['total'] = {'words':0, 'sentences':0}    
         punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
         sentence_splitter = PunktSentenceTokenizer(punkt_param)
         sentences = sentence_splitter.tokenize(each_paragraph)
         for each_sentence in sentences:
             self.stotal += 1
             buffer_s = sentence()
             buffer_s.sentence = each_sentence
             buffer_s.tokens = nltk.word_tokenize(preprocess(each_sentence))
             if len(buffer_s.tokens) > 0:
                 buffer_s.weights['sentence'] = FreqDist(buffer_s.tokens)
                 buffer_s.weights['paragraph'] = self.calculate_relative_frequence(buffer_s.tokens, buffer_p.weights['words'])
                 buffer_s.weights['document'] = self.calculate_relative_frequence(buffer_s.tokens, self.fractal.weights)
                 buffer_s.weights['total'] = {}
                 buffer_s.weights['total']['sentence'] = 1
                 buffer_s.weights['total']['paragraph'] = sum(buffer_s.weights['paragraph'].values())
                 buffer_s.weights['total']['document'] = sum(buffer_s.weights['document'].values())
                 self.s_weight += buffer_s.weights['total']['document']
                 buffer_p.weights['total']['sentences'] += buffer_s.weights['total']['document']
                 buffer_p.sentences.append(buffer_s)
         buffer_p.weights['total']['words'] = sum(buffer_p.weights['words'].values())
         self.fractal.paragraphs.append(buffer_p)
         self.pindex += 1

开发者ID:yagamiram，项目名称:NLP_Auto_Summarization，代码行数:32，代码来源:fractal_template.py

示例9: preprocess

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
def preprocess(phys):
    '''
    :param fname: a text file
    :return: a json of sentences, processed for searchability
    '''

    phys = phys.decode('utf-8')
    phys = re.sub('(\n)+', '. ', phys)

    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(phys)

    for i in xrange(len(sentences)):
        sentence = unicode(sentences[i])
        sentence = sentence.replace('\n', ' ')
        sentence = re.sub(' +',' ',sentence)
        sentence = re.sub(r'\d+', '', sentence)
        sentence = sentence.replace("-"," ")
        exclude = string.punctuation
        sentence = ''.join(ch for ch in sentence if ch not in exclude)
        sentence = re.sub(' +',' ',sentence)
        sentences[i] = sentence
        # sentences[i] = sentence.encode('utf-8')
    count = 0
    for sentence in sentences:
        if sentence == ' ' or sentence == '':
            sentences.pop(count)
        count +=1

    # with open(fname.rstrip('txt')+'json', 'w') as outfile:
    #     json.dump(sentences, outfile)

    return sentences

开发者ID:DevJared，项目名称:Manhattan-Project，代码行数:35，代码来源:pdf2text.py

示例10: _punkt_sent_tokenize

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
def _punkt_sent_tokenize(text):
    '''
     Sentence segmentation using nltk PunktSentenceTokenizer.
    '''
    punkt_param = PunktParameters()
    punkt_param.abbrev_types = set(config.tokenize_abbrev)
    sentence_splitter = PunktSentenceTokenizer(punkt_param)
    return sentence_splitter.tokenize(text)

开发者ID:khasathan，项目名称:nlp，代码行数:10，代码来源:stringutil.py

示例11: sentences

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
    def sentences(self):
        try:
            return self.sentences_list

        except(AttributeError):
            sentence_tokenizer = SentenceTokenizer()
            self.sentences_list = sentence_tokenizer.tokenize(self.corpus)
            return self.sentences_list

开发者ID:nalourie，项目名称:digital-humanities，代码行数:10，代码来源:analysis_script.py

示例12: init

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
class TagExtractor:
  """Extracts tags from a body of text using the NLTK toolkit."""
  
  def __init__(self):
    """Creates a default Topia tagger and extractor."""
    self.sentence_tokenizer = PunktSentenceTokenizer()
    self.parser = nltk.RegexpParser(GRAMMAR)
    self.productions = ['NP', 'VB', 'ADV']

  def __is_just_stop_words(self, words):
    return not any([word not in STOP_WORDS for word in words])

  def extract_tags(self, text):
    """Extract tags from the text."""
    tags = {} 
    for sentence in self.sentence_tokenizer.tokenize(text):
      chunks = self.__chunk_sentence(sentence)
      for production in chunks.productions():
        tag_tokens = []
        pos = production.lhs().symbol()
        if pos in self.productions:
          for (word, x) in production.rhs():
            # Preprocess, and potentially, filter out the word.
            trimmed = filter_word(trim_word(word))
            if trimmed:
              tag_tokens.append(trimmed.lower())
          if len(tag_tokens) > 0:
            tag_text = string.join(tag_tokens, ' ')
            if self.__is_just_stop_words(tag_tokens):
              continue
            tag = self.__lookup_tag(tags, tag_text, pos) 
            tag.increment_occurs()
            tag.set_pos(pos)
    results = tags.values()
    results.sort(key = tag_compare_key)
    return results

  def __lookup_tag(self, tags, text, pos):
    tag = tags.get(self.__get_tag_key(text, pos))
    if not tag:
      tag = Tag(text, 0, pos)
      tags[self.__get_tag_key(text, pos)] = tag
    return tag

  def __get_tag_key(self, text, pos):
    """I want to keep the way we look up tags flexible so that I can easily change my mind
       on what uniquely identifies a tag (e.g. just the text?  the text and the part of speech?).
       That is why all the logic for looking up tags is in this one method."""
    return text
    
  def __chunk_sentence(self, sentence):
    """Tokenize the sentence into words using a whitespace parser to avoid parsing couldn't into two tokens (could and n't).
       Then chunk the tokens according to GRAMMAR.
    """
    tokenizer = WhitespaceTokenizer()
    tokens = tokenizer.tokenize(sentence)
    pos_tagged = nltk.pos_tag(tokens)
    return self.parser.parse(pos_tagged)

开发者ID:timjstewart，项目名称:tag-extractor，代码行数:60，代码来源:nltk_extractor.py

示例13: transform

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
 def transform(self,documents):
     sentence_splitter = PunktSentenceTokenizer()
     for doc in documents:
         if not 'sentences' in doc.ext:
             doc.ext['sentences'] = [s.strip() for s in sentence_splitter.tokenize(doc.text)]
     # for doc in documents:
     #     if not 'sentences' in doc.ext:
     #         doc.ext['sentences'] = [s.strip() for s in doc.text.split('.') if s]
     return documents

开发者ID:tribhuvanesh，项目名称:nlpfs14，代码行数:11，代码来源:nlplearn.py

示例14: parse

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
def parse (text):
    """Use nltk's PunktSentenceTokenizer to convert the text string into
    a list of English-language sentences."""

    punkt_param = PunktParameters()
    punkt_param.abbrev_types = set(ABBREVIATIONS)
    sentence_splitter = PunktSentenceTokenizer(punkt_param)

    return sentence_splitter.tokenize(preprocess(text))

开发者ID:csytan，项目名称:nastyboys，代码行数:11，代码来源:get_sentences.py

示例15: bayesSentiment

# 需要导入模块: from nltk.tokenize.punkt import PunktSentenceTokenizer [as 别名]
# 或者: from nltk.tokenize.punkt.PunktSentenceTokenizer import tokenize [as 别名]
	def bayesSentiment(self, text):
		from nltk.tokenize.punkt import PunktSentenceTokenizer
		from senti_classifier import senti_classifier

		# break up text into sentences
		stzr = PunktSentenceTokenizer()
		sents = stzr.tokenize(text)
		pos_score, neg_score = senti_classifier.polarity_scores(sents)
		#print pos_score, neg_score
		return [pos_score, neg_score]

开发者ID:vanjos，项目名称:kairos_sentiment，代码行数:12，代码来源:simplesentiment.py

注：本文中的nltk.tokenize.punkt.PunktSentenceTokenizer.tokenize方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。