本文整理匯總了Python中sklearn.feature_extraction.text.TfidfVectorizer.build_preprocessor方法的典型用法代碼示例。如果您正苦於以下問題:Python TfidfVectorizer.build_preprocessor方法的具體用法?Python TfidfVectorizer.build_preprocessor怎麽用?Python TfidfVectorizer.build_preprocessor使用的例子?那麽, 這裏精選的方法代碼示例或許可以為您提供幫助。您也可以進一步了解該方法所在類sklearn.feature_extraction.text.TfidfVectorizer
的用法示例。
在下文中一共展示了TfidfVectorizer.build_preprocessor方法的2個代碼示例,這些例子默認根據受歡迎程度排序。您可以為喜歡或者感覺有用的代碼點讚,您的評價將有助於係統推薦出更棒的Python代碼示例。
示例1: TfidfVectorizer
# 需要導入模塊: from sklearn.feature_extraction.text import TfidfVectorizer [as 別名]
# 或者: from sklearn.feature_extraction.text.TfidfVectorizer import build_preprocessor [as 別名]
X_test = np.array([''.join(el) for el in nyt_data[trainset_size + 1:len(nyt_data)]])
y_test = np.array([el for el in nyt_labels[trainset_size + 1:len(nyt_labels)]])
#print(X_train)
vectorizer = TfidfVectorizer(min_df=2,
ngram_range=(1, 2),
stop_words='english',
strip_accents='unicode',
norm='l2')
test_string = unicode(nyt_data[0])
print "Example string: " + test_string
print "Preprocessed string: " + vectorizer.build_preprocessor()(test_string)
print "Tokenized string:" + str(vectorizer.build_tokenizer()(test_string))
print "N-gram data string:" + str(vectorizer.build_analyzer()(test_string))
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
svm_classifier = LinearSVC().fit(X_train, y_train)
示例2: CountVectorizer
# 需要導入模塊: from sklearn.feature_extraction.text import TfidfVectorizer [as 別名]
# 或者: from sklearn.feature_extraction.text.TfidfVectorizer import build_preprocessor [as 別名]
test_data[i,1] = 0
count_pos_test = count_neg_test + 1
label_test = test_data[:,1]
#vctr = CountVectorizer(stop_words='english',min_df = 1)
#vctr2 = HashingVectorizer(stop_words='english')
vctr = TfidfVectorizer(stop_words='english') #intailising vectorizers TF-IDF gives better accuracy by 1 percent compared to the other vectors
count_pos = 0
count_neg = 0
######################################################################################################
train = []
test = []
for i in range(len(train_data)): #processing of the train data
string = train_data[i,0]
string = vctr.build_preprocessor()(string.lower())
string = vctr.build_tokenizer()(string.lower())
train.append(' '.join(string))
for i in range(len(test_data)): #processing of the test data
string = test_data[i,0]
string = vctr.build_preprocessor()(string.lower())
string = vctr.build_tokenizer()(string.lower())
test.append(' '.join(string))
######################################################################################################
train_data1 = vctr.fit_transform(train).toarray() #fitting the dictionary for bag of words model using TF-IDF vectorizers
#X_test = vctr.transform(test).toarray()
y_train = np.asarray(label_train, dtype="|S6")
y_train = y_train.astype(int)
clf1 = GradientBoostingClassifier(n_estimators = 500) #initialising classifiers