当前位置: 首页>>代码示例>>Java>>正文


Java NGramTokenizer.setDelimiters方法代码示例

本文整理汇总了Java中weka.core.tokenizers.NGramTokenizer.setDelimiters方法的典型用法代码示例。如果您正苦于以下问题:Java NGramTokenizer.setDelimiters方法的具体用法?Java NGramTokenizer.setDelimiters怎么用?Java NGramTokenizer.setDelimiters使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在weka.core.tokenizers.NGramTokenizer的用法示例。


在下文中一共展示了NGramTokenizer.setDelimiters方法的3个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Java代码示例。

示例1: getStringToWordVectorFilter

import weka.core.tokenizers.NGramTokenizer; //导入方法依赖的package包/类
private StringToWordVector getStringToWordVectorFilter(Instances instances) throws Exception {
  StringToWordVector stringToWordVector = new StringToWordVector();
  stringToWordVector.setAttributeIndices(indicesToRangeList(new int[]{
    instances.attribute(SURFACE_TEXT_AND_POS_TAG_OF_TWO_PRECEDING_AND_FOLLOWING_TOKENS_AROUND_THE_DESC_CANDIDATE).index(),
    instances.attribute(SURFACE_TEXT_AND_POS_TAG_OF_THREE_PRECEDING_AND_FOLLOWING_TOKENS_AROUND_THE_PAIRED_MATH_EXPR).index(),
    instances.attribute(SURFACE_TEXT_OF_THE_FIRST_VERB_THAT_APPEARS_BETWEEN_THE_DESC_CANDIDATE_AND_THE_TARGET_MATH_EXPR).index(),
    instances.attribute(SURFACE_TEXT_AND_POS_TAG_OF_DEPENDENCY_WITH_LENGTH_3_FROM_IDENTIFIER).index(),
    instances.attribute(SURFACE_TEXT_AND_POS_TAG_OF_DEPENDENCY_WITH_LENGTH_3_FROM_DEFINIEN).index()}));
  stringToWordVector.setWordsToKeep(1000);
  NGramTokenizer nGramTokenizer = new NGramTokenizer();
  nGramTokenizer.setNGramMaxSize(3);
  nGramTokenizer.setNGramMinSize(1);
  nGramTokenizer.setDelimiters(nGramTokenizer.getDelimiters().replaceAll(":", ""));
  stringToWordVector.setTokenizer(nGramTokenizer);
  stringToWordVector.setInputFormat(instances);
  return stringToWordVector;
}
 
开发者ID:ag-gipp,项目名称:mathosphere,代码行数:18,代码来源:WekaLearner.java

示例2: WordNgrams

import weka.core.tokenizers.NGramTokenizer; //导入方法依赖的package包/类
public static StringToWordVector WordNgrams(Properties prop) throws Exception{
    final StringToWordVector filter = new StringToWordVector();
    filter.setAttributeIndices("first-last");
    filter.setOutputWordCounts(false);
    filter.setTFTransform(false);
    filter.setIDFTransform(false);
    //if (prop.getProperty("Preprocessings.removeStopWords").equalsIgnoreCase("yes")) filter.setStopwords(new File("ressources//MotsVides.txt"));
    filter.setWordsToKeep(10000);
    filter.setMinTermFreq(1);
    NGramTokenizer tok = new NGramTokenizer();
    tok.setDelimiters(" \n 	.,;'\"()?!-/<>‘’“”…«»•&{[|`^]}$*%");
    tok.setNGramMinSize(Integer.parseInt(prop.getProperty("Ngrams.min")));
    tok.setNGramMaxSize(Integer.parseInt(prop.getProperty("Ngrams.max")));
    filter.setTokenizer(tok);
    
    return filter;
}
 
开发者ID:amineabdaoui,项目名称:french-sentiment-classification,代码行数:18,代码来源:Tokenisation.java

示例3: calculateWordCount

import weka.core.tokenizers.NGramTokenizer; //导入方法依赖的package包/类
@Override
public Map<String, Integer> calculateWordCount(final DocumentContentData documentContentData, final int maxResult) {

	final String html = documentContentData.getContent();

	final Attribute input = new Attribute(HTML, (ArrayList<String>) null);

	final ArrayList<Attribute> inputVec = new ArrayList<>();
	inputVec.add(input);

	final Instances htmlInst = new Instances(HTML, inputVec, 1);

	htmlInst.add(new DenseInstance(1));
	htmlInst.instance(0).setValue(0, html);


	final StopwordsHandler stopwordsHandler = new StopwordsHandler() {

		@Override
		public boolean isStopword(final String word) {

			return word.length() <5;
		}
	};

	final NGramTokenizer tokenizer = new NGramTokenizer();
	tokenizer.setNGramMinSize(1);
	tokenizer.setNGramMaxSize(1);
	tokenizer.setDelimiters(TOKEN_DELIMITERS);

	final StringToWordVector filter = new StringToWordVector();
	filter.setTokenizer(tokenizer);
	filter.setStopwordsHandler(stopwordsHandler);
	filter.setLowerCaseTokens(true);
	filter.setOutputWordCounts(true);
	filter.setWordsToKeep(maxResult);

	final Map<String,Integer> result = new HashMap<>();

	try {
		filter.setInputFormat(htmlInst);
		final Instances dataFiltered = Filter.useFilter(htmlInst, filter);

		final Instance last = dataFiltered.lastInstance();

		final int numAttributes = last.numAttributes();

		for (int i = 0; i < numAttributes; i++) {
			result.put(last.attribute(i).name(), Integer.valueOf(last.toString(i)));
		}
	} catch (final Exception e) {
		LOGGER.warn("Problem calculating wordcount for : {} , exception:{}",documentContentData.getId() ,e);
	}


	return result;
}
 
开发者ID:Hack23,项目名称:cia,代码行数:58,代码来源:WordCounterImpl.java


注:本文中的weka.core.tokenizers.NGramTokenizer.setDelimiters方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。