本文整理汇总了Python中nltk.PorterStemmer.strip方法的典型用法代码示例。如果您正苦于以下问题:Python PorterStemmer.strip方法的具体用法?Python PorterStemmer.strip怎么用?Python PorterStemmer.strip使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类nltk.PorterStemmer
的用法示例。
在下文中一共展示了PorterStemmer.strip方法的1个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。
示例1: processEmail
# 需要导入模块: from nltk import PorterStemmer [as 别名]
# 或者: from nltk.PorterStemmer import strip [as 别名]
def processEmail(email_contents):
#PROCESSEMAIL preprocesses a the body of an email and
#returns a list of word_indices
# word_indices = PROCESSEMAIL(email_contents) preprocesses
# the body of an email and returns a list of indices of the
# words contained in the email.
#
# Load Vocabulary
vocabList = gvl.getVocabList()
# Init return value
word_indices = []
# ========================== Preprocess Email ===========================
# Find the Headers ( \n\n and remove )
# Uncomment the following lines if you are working with raw emails with the
# full headers
# hdrstart = email_contents.find("\n\n")
# if hdrstart:
# email_contents = email_contents[hdrstart:]
# Lower case
email_contents = email_contents.lower()
# Strip all HTML
# Looks for any expression that starts with < and ends with > and replace
# and does not have any < or > in the tag it with a space
email_contents = re.sub('<[^<>]+>', ' ', email_contents)
# Handle Numbers
# Look for one or more characters between 0-9
email_contents = re.sub('[0-9]+', 'number', email_contents)
# Handle URLS
# Look for strings starting with http:// or https://
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
# Handle Email Addresses
# Look for strings with @ in the middle
email_contents = re.sub('[^\s][email protected][^\s]+', 'emailaddr', email_contents)
# Handle $ sign
email_contents = re.sub('[$]+', 'dollar', email_contents)
# ========================== Tokenize Email ===========================
# Output the email to screen as well
print('\n==== Processed Email ====\n\n')
# Process file
l = 0
# Slightly different order from matlab version
# Split and also get rid of any punctuation
# regex may need further debugging...
email_contents = re.split(r'[@$/#.-:&\*\+=\[\]?!(){},\'\'\">_<;%\s\n\r\t]+', email_contents)
for token in email_contents:
# Remove any non alphanumeric characters
token = re.sub('[^a-zA-Z0-9]', '', token)
# Stem the word
token = PorterStemmer().stem_word(token.strip())
# Skip the word if it is too short
if len(token) < 1:
continue
# Look up the word in the dictionary and add to word_indices if
# found
# ====================== YOUR CODE HERE ======================
# Instructions: Fill in this function to add the index of str to
# word_indices if it is in the vocabulary. At this point
# of the code, you have a stemmed word from the email in
# the variable str. You should look up str in the
# vocabulary list (vocabList). If a match exists, you
# should add the index of the word to the word_indices
# vector. Concretely, if str = 'action', then you should
# look up the vocabulary list to find where in vocabList
# 'action' appears. For example, if vocabList{18} =
# 'action', then, you should add 18 to the word_indices
# vector (e.g., word_indices = [word_indices ; 18]; ).
#
# Note: vocabList{idx} returns a the word with index idx in the
# vocabulary list.
#
# Note: You can use strcmp(str1, str2) to compare two strings (str1 and
# str2). It will return 1 only if the two strings are equivalent.
#
idx = vocabList[token] if token in vocabList else 0
# only add entries which are in vocabList
# i.e. those with ind ~= 0,
#.........这里部分代码省略.........