Python FreqDist.hapaxes方法代码示例

本文整理汇总了Python中nltk.FreqDist.hapaxes方法的典型用法代码示例。如果您正苦于以下问题：Python FreqDist.hapaxes方法的具体用法？Python FreqDist.hapaxes怎么用？Python FreqDist.hapaxes使用的例子？那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类nltk.FreqDist的用法示例。

在下文中一共展示了FreqDist.hapaxes方法的3个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: percentage

# 需要导入模块: from nltk import FreqDist [as 别名]
# 或者: from nltk.FreqDist import hapaxes [as 别名]
def percentage(count, total):
    return 100 * count / total

lexical_diversity(text4)
percentage(text4.count('a'), len(text4))

# Simple statistics
from nltk import FreqDist
# Counting Words Appearing in a Text (a frequency distribution)
fdist1 = FreqDist(text4)
fdist1
vocabulary1 = fdist1.keys() # list of all the distinct types in the text
vocabulary1[:3] # look at first 3

#words that occur only once, called hapaxes 
fdist1.hapaxes()[:20]

# Words that meet a condition, are long for example
V = set(text4)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

#finding words that characterize a text, relatively long, and occur frequently
fdist = FreqDist(text4)
sorted([w for w in set(text4) if len(w) > 7 and fdist[w] > 7])

# Collocations and Bigrams. 
# A collocation is a sequence of words that occur together unusually often. 
# Built in collocations function
text4.collocations()

开发者ID:iuliacioroianu，项目名称:NLTK_examples，代码行数:32，代码来源:NLTK_tutorial_NYU.py

示例2: FreqDist

# 需要导入模块: from nltk import FreqDist [as 别名]
# 或者: from nltk.FreqDist import hapaxes [as 别名]
fd = FreqDist(brown.words())

# Find the most frequent words in a text:
# http://stackoverflow.com/questions/268272/getting-key-with-maximum-value-in-dictionary
import operator
max(fd.iteritems(), key=operator.itemgetter(1))
sorted(fd.iteritems(), key=operator.itemgetter(1), reverse=True)[:10]
# Or use the wrapper function
fd.most_common(10)

# plot the most frequent words
fd.plot(10)
fd.plot(10, cumulative=True)

# See the words with lowest frequency (these words are called hapaxes)
fd.hapaxes()

# Count all the words
len(text1)
# count unique words
len(set(text1))
# count unique words, irrespective of word case
len(set(w.lower() for w in text1))


# Find the words that are more than 15 characters long
words = set(brown.words())
long_words = [w for w in words if len(w) > 15]

# Words that are more frequent than 7 times and are more than 7 characters long
rare_and_long = sorted(w for w in set(brown.words()) if len(w) > 7 and fd[w] > 7)

开发者ID:DeepakSinghRawat，项目名称:Tutorials，代码行数:33，代码来源:NLP_tut.py

示例3: FreqDist

# 需要导入模块: from nltk import FreqDist [as 别名]
# 或者: from nltk.FreqDist import hapaxes [as 别名]
import nltk
from nltk.corpus import gutenberg  # 导入 gutenberg 集
##################################################################
## FreqDist 跟踪分布中的采样频率 (sample frequencies)
from nltk import FreqDist  # 导入 FreqDist 类
fd = FreqDist(gutenberg.words('austen-persuasion.txt'))  # 频率分布实例化, 统计文本中的 Token
print(fd)  # <FreqDist with 51156 samples and 2621613 outcomes>; 可以得到 51156 个 不重复值, 2621613 个 token
print(type(fd))  # <class 'nltk.probability.FreqDist'>
print(fd['the'])  # 3120; 查看 word 出现次数; 默认 FreqDist 是一个字典
print(fd.N())  # 98171; 是单词, 不是字母, 有重复的
print(fd.B())  # 6132; number of bins or unique samples; 唯一单词, bins 表示相同的会在一个 bin 中
print(len(fd.keys()), type(fd.keys()))  # 6132 <class 'dict_keys'>
print(fd.keys())  # fd.B() 只是输出个数, 这个是把所有词汇表输出
print(fd.max())  # 频率最高的一个词
print(fd.freq('the'))  # 0.03178127960395636; 出现频率 3120 / 98171
print(fd.hapaxes())  # ['[', 'Persuasion', 'Jane', ...] 只出现一次的罕用词
# 出现频率最高的大多是一些"虚词", 出现频率极低的(hapaxes)又只能靠上下文来理解; 文本中出现频率最高和最低的那些词往往并不能反映这个文本的特征
for idx, word in enumerate(fd):  # 可以用 enumerate 来遍历, 是按出现顺序排的
    if idx == 5: break
    print(idx, word)  # 0 [; 1 Persuasion; 2 by; 3 Jane; 4 Austen
##################################################################
## 统计词的长度频率
fdist = FreqDist(len(w) for w in gutenberg.words('austen-persuasion.txt'))
print(fdist)  # <FreqDist with 16 samples and 98171 outcomes>
print(fdist.items())  # dict_items([(1, 16274), (10, 1615), (2, 16165), (4, 15613), (6, 6538), (7, 5714), (3, 20013), (8, 3348), (13, 230), (9, 2887), (5, 8422), (11, 768), (12, 486), (14, 69), (15, 25), (16, 4)])
print(fdist.most_common(3))  # [(3, 20013), (1, 16274), (2, 16165)]
##################################################################
## 统计 英文字符
fdist = nltk.FreqDist(ch.lower() for ch in gutenberg.raw('austen-persuasion.txt') if ch.isalpha())  # 可以不用 [] 将生成器 list 化
print(fdist.most_common(5))  # [('e', 46949), ('t', 32192), ('a', 29371), ('o', 27617), ('n', 26718)]
print([char for (char, count) in fdist.most_common()])  # 26 个字母使用频率排序

开发者ID:coder352，项目名称:shellscript，代码行数:33，代码来源:l21_FreqDist-词频统计_Zipf-Law-可视化.py

注：本文中的nltk.FreqDist.hapaxes方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。