Java TokenizerFactory.getTokenizer方法代码示例

本文整理汇总了Java中edu.stanford.nlp.process.TokenizerFactory.getTokenizer方法的典型用法代码示例。如果您正苦于以下问题：Java TokenizerFactory.getTokenizer方法的具体用法？Java TokenizerFactory.getTokenizer怎么用？Java TokenizerFactory.getTokenizer使用的例子？那么, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类edu.stanford.nlp.process.TokenizerFactory的用法示例。

在下文中一共展示了TokenizerFactory.getTokenizer方法的3个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于系统推荐出更棒的Java代码示例。

示例1: demoAPI

import edu.stanford.nlp.process.TokenizerFactory; //导入方法依赖的package包/类
/**
 * demoAPI demonstrates other ways of calling the parser with already
 * tokenized text, or in some cases, raw text that needs to be tokenized as
 * a single sentence. Output is handled with a TreePrint object. Note that
 * the options used when creating the TreePrint can determine what results
 * to print out. Once again, one can capture the output by passing a
 * PrintWriter to TreePrint.printTree.
 * 
 * difference： already tokenized text
 * 
 * 
 */
public static void demoAPI(LexicalizedParser lp) {
	// This option shows parsing a list of correctly tokenized words
	String[] sent = { "This", "is", "an", "easy", "sentence", "." };
	List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
	Tree parse = lp.apply(rawWords);
	parse.pennPrint();
	System.out.println();

	// This option shows loading and using an explicit tokenizer
	String sent2 = "Hey @Apple, pretty much all your products are amazing. You blow minds every time you launch a new gizmo."
			+ " that said, your hold music is crap";
	TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(
			new CoreLabelTokenFactory(), "");
	Tokenizer<CoreLabel> tok = tokenizerFactory
			.getTokenizer(new StringReader(sent2));
	List<CoreLabel> rawWords2 = tok.tokenize();
	parse = lp.apply(rawWords2);

	TreebankLanguagePack tlp = new PennTreebankLanguagePack();
	GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
	GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
	List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
	System.out.println(tdl);
	System.out.println();

	// You can also use a TreePrint object to print trees and dependencies
	TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
	tp.printTree(parse);
}

开发者ID:opinion-extraction-propagation，项目名称:TASC-Tuples，代码行数:42，代码来源:ParserDemo.java

示例2: main

import edu.stanford.nlp.process.TokenizerFactory; //导入方法依赖的package包/类
/**
 * A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
 * Performs punctuation splitting and light tokenization by default.
 * Orthographic normalization options are available, and can be enabled with
 * command line options.
 * <p>
 * Currently, this tokenizer does not do line splitting. It normalizes non-printing
 * line separators across platforms and prints the system default line splitter
 * to the output.
 * </p>
 * <p>
 * The following normalization options are provided:
 * <ul>
 * <li><code>useUTF8Ellipsis</code> : Replaces sequences of three or more full stops with \u2026</li>
 * <li><code>normArDigits</code> : Convert Arabic digits to ASCII equivalents</li>
 * <li><code>normArPunc</code> : Convert Arabic punctuation to ASCII equivalents</li>
 * <li><code>normAlif</code> : Change all alif forms to bare alif</li>
 * <li><code>normYa</code> : Map ya to alif maqsura</li>
 * <li><code>removeDiacritics</code> : Strip all diacritics</li>
 * <li><code>removeTatweel</code> : Strip tatweel elongation character</li>
 * <li><code>removeQuranChars</code> : Remove diacritics that appear in the Quran</li>
 * <li><code>removeProMarker</code> : Remove the ATB null pronoun marker</li>
 * <li><code>removeSegMarker</code> : Remove the ATB clitic segmentation marker</li>
 * <li><code>removeMorphMarker</code> : Remove the ATB morpheme boundary markers</li>
 * <li><code>atbEscaping</code> : Replace left/right parentheses with ATB escape characters</li>
 * </ul>
 * </p>
 *
 * @param args
 */
public static void main(String[] args) {
  if (args.length > 0 && args[0].contains("help")) {
    System.err.printf("Usage: java %s [OPTIONS] < file%n", ArabicTokenizer.class.getName());
    System.err.printf("%nOptions:%n");
    System.err.println("   -help : Print this message. See javadocs for all normalization options.");
    System.err.println("   -atb  : Tokenization for the parsing experiments in Green and Manning (2010)");
    System.exit(-1);
  }

  // Process normalization options
  final Properties tokenizerOptions = StringUtils.argsToProperties(args);
  final TokenizerFactory<CoreLabel> tf = tokenizerOptions.containsKey("atb") ?
      ArabicTokenizer.atbFactory() : ArabicTokenizer.factory();
  for (String option : tokenizerOptions.stringPropertyNames()) {
    tf.setOptions(option);
  }

  // Replace line separators with a token so that we can
  // count lines
  tf.setOptions("tokenizeNLs");

  // Read the file
  int nLines = 0;
  int nTokens = 0;
  final String encoding = "UTF-8";
  try {
    Tokenizer<CoreLabel> tokenizer = tf.getTokenizer(new InputStreamReader(System.in, encoding));
    boolean printSpace = false;
    while (tokenizer.hasNext()) {
      ++nTokens;
      String word = tokenizer.next().word();
      if (word.equals(ArabicLexer.NEWLINE_TOKEN)) {
        ++nLines;
        printSpace = false;
        System.out.println();
      } else {
        if (printSpace) System.out.print(" ");
        System.out.print(word);
        printSpace = true;
      }
    }
  } catch (UnsupportedEncodingException e) {
    e.printStackTrace();
  }
  System.err.printf("Done! Tokenized %d lines (%d tokens)%n", nLines, nTokens);
}

开发者ID:benblamey，项目名称:stanford-nlp，代码行数:77，代码来源:ArabicTokenizer.java

示例3: parse

import edu.stanford.nlp.process.TokenizerFactory; //导入方法依赖的package包/类
/**
 * Will parse the text in <code>sentence</code> as if it represented
 * a single sentence by first processing it with a tokenizer.
 */
public Tree parse(String sentence) {
  TokenizerFactory<? extends HasWord> tf = op.tlpParams.treebankLanguagePack().getTokenizerFactory();
  Tokenizer<? extends HasWord> tokenizer = tf.getTokenizer(new BufferedReader(new StringReader(sentence)));
  return parse(tokenizer.tokenize());
}

开发者ID:benblamey，项目名称:stanford-nlp，代码行数:10，代码来源:LexicalizedParser.java

注：本文中的edu.stanford.nlp.process.TokenizerFactory.getTokenizer方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。