本文整理汇总了Python中Filter.Filter.check_duplicates方法的典型用法代码示例。如果您正苦于以下问题:Python Filter.check_duplicates方法的具体用法?Python Filter.check_duplicates怎么用?Python Filter.check_duplicates使用的例子?那么, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类Filter.Filter
的用法示例。
在下文中一共展示了Filter.check_duplicates方法的2个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。
示例1: time
# 需要导入模块: from Filter import Filter [as 别名]
# 或者: from Filter.Filter import check_duplicates [as 别名]
for document in cursor:
text = ' '.join(document["text"].encode("utf-8").split())
corpus.append(text)
ids.append(document["_id"])
# filter repeated tweets
t0 = time()
i = 0
status = -1
unique_tweets = ["Dummy Tweet"]
length = len(corpus)
print("Filtering tweets may take a few minutes...")
for document in corpus:
for tweet in unique_tweets:
status = tweet_filter.check_duplicates(document, tweet)
if status:
break
if not status:
unique_tweets.append(document)
i += 1
if i > 3000:
break
print("done in %0.3fs." % (time() - t0))
unique_tweets.pop(0)
corpus = unique_tweets
# create sample by bootstrap sampling
random_indices = random.sample(range(0, len(corpus)), q.num_of_docs)
# Open file I/O streams
示例2: str
# 需要导入模块: from Filter import Filter [as 别名]
# 或者: from Filter.Filter import check_duplicates [as 别名]
# Open file I/O streams
directory = os.path.dirname(os.getcwd())
fn = "sample_" + str(months[month]) + "_" + str(day) + ".json"
f = open(directory + "/data/" + fn, "w+")
# load tweet with id
corpus = [{"text": "dummy"}]
tweetFilter = Filter(45)
i = 0
print("Filtering Results...")
for document in cursor:
document["_id"] = str(document["_id"])
document["text"] = document["text"].replace('"', "'")
for tweet in corpus:
# If return a match then append to unique tweets
status = tweetFilter.check_duplicates(document["text"], tweet["text"])
if status:
break
if not status:
corpus.append(document["text"])
i += 1
if i >= 100:
break
print(i)
# Remove header
corpus.pop(0)
json.dump(corpus, f, indent=1)