本文整理汇总了Java中edu.stanford.nlp.process.TokenizerFactory.getTokenizer方法的典型用法代码示例。如果您正苦于以下问题:Java TokenizerFactory.getTokenizer方法的具体用法?Java TokenizerFactory.getTokenizer怎么用?Java TokenizerFactory.getTokenizer使用的例子?那么, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类edu.stanford.nlp.process.TokenizerFactory
的用法示例。
在下文中一共展示了TokenizerFactory.getTokenizer方法的3个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Java代码示例。
示例1: demoAPI
import edu.stanford.nlp.process.TokenizerFactory; //导入方法依赖的package包/类
/**
* demoAPI demonstrates other ways of calling the parser with already
* tokenized text, or in some cases, raw text that needs to be tokenized as
* a single sentence. Output is handled with a TreePrint object. Note that
* the options used when creating the TreePrint can determine what results
* to print out. Once again, one can capture the output by passing a
* PrintWriter to TreePrint.printTree.
*
* difference: already tokenized text
*
*
*/
public static void demoAPI(LexicalizedParser lp) {
// This option shows parsing a list of correctly tokenized words
String[] sent = { "This", "is", "an", "easy", "sentence", "." };
List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
Tree parse = lp.apply(rawWords);
parse.pennPrint();
System.out.println();
// This option shows loading and using an explicit tokenizer
String sent2 = "Hey @Apple, pretty much all your products are amazing. You blow minds every time you launch a new gizmo."
+ " that said, your hold music is crap";
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(
new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tok = tokenizerFactory
.getTokenizer(new StringReader(sent2));
List<CoreLabel> rawWords2 = tok.tokenize();
parse = lp.apply(rawWords2);
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
// You can also use a TreePrint object to print trees and dependencies
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.printTree(parse);
}
示例2: main
import edu.stanford.nlp.process.TokenizerFactory; //导入方法依赖的package包/类
/**
* A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
* Performs punctuation splitting and light tokenization by default.
* Orthographic normalization options are available, and can be enabled with
* command line options.
* <p>
* Currently, this tokenizer does not do line splitting. It normalizes non-printing
* line separators across platforms and prints the system default line splitter
* to the output.
* </p>
* <p>
* The following normalization options are provided:
* <ul>
* <li><code>useUTF8Ellipsis</code> : Replaces sequences of three or more full stops with \u2026</li>
* <li><code>normArDigits</code> : Convert Arabic digits to ASCII equivalents</li>
* <li><code>normArPunc</code> : Convert Arabic punctuation to ASCII equivalents</li>
* <li><code>normAlif</code> : Change all alif forms to bare alif</li>
* <li><code>normYa</code> : Map ya to alif maqsura</li>
* <li><code>removeDiacritics</code> : Strip all diacritics</li>
* <li><code>removeTatweel</code> : Strip tatweel elongation character</li>
* <li><code>removeQuranChars</code> : Remove diacritics that appear in the Quran</li>
* <li><code>removeProMarker</code> : Remove the ATB null pronoun marker</li>
* <li><code>removeSegMarker</code> : Remove the ATB clitic segmentation marker</li>
* <li><code>removeMorphMarker</code> : Remove the ATB morpheme boundary markers</li>
* <li><code>atbEscaping</code> : Replace left/right parentheses with ATB escape characters</li>
* </ul>
* </p>
*
* @param args
*/
public static void main(String[] args) {
if (args.length > 0 && args[0].contains("help")) {
System.err.printf("Usage: java %s [OPTIONS] < file%n", ArabicTokenizer.class.getName());
System.err.printf("%nOptions:%n");
System.err.println(" -help : Print this message. See javadocs for all normalization options.");
System.err.println(" -atb : Tokenization for the parsing experiments in Green and Manning (2010)");
System.exit(-1);
}
// Process normalization options
final Properties tokenizerOptions = StringUtils.argsToProperties(args);
final TokenizerFactory<CoreLabel> tf = tokenizerOptions.containsKey("atb") ?
ArabicTokenizer.atbFactory() : ArabicTokenizer.factory();
for (String option : tokenizerOptions.stringPropertyNames()) {
tf.setOptions(option);
}
// Replace line separators with a token so that we can
// count lines
tf.setOptions("tokenizeNLs");
// Read the file
int nLines = 0;
int nTokens = 0;
final String encoding = "UTF-8";
try {
Tokenizer<CoreLabel> tokenizer = tf.getTokenizer(new InputStreamReader(System.in, encoding));
boolean printSpace = false;
while (tokenizer.hasNext()) {
++nTokens;
String word = tokenizer.next().word();
if (word.equals(ArabicLexer.NEWLINE_TOKEN)) {
++nLines;
printSpace = false;
System.out.println();
} else {
if (printSpace) System.out.print(" ");
System.out.print(word);
printSpace = true;
}
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
System.err.printf("Done! Tokenized %d lines (%d tokens)%n", nLines, nTokens);
}
示例3: parse
import edu.stanford.nlp.process.TokenizerFactory; //导入方法依赖的package包/类
/**
* Will parse the text in <code>sentence</code> as if it represented
* a single sentence by first processing it with a tokenizer.
*/
public Tree parse(String sentence) {
TokenizerFactory<? extends HasWord> tf = op.tlpParams.treebankLanguagePack().getTokenizerFactory();
Tokenizer<? extends HasWord> tokenizer = tf.getTokenizer(new BufferedReader(new StringReader(sentence)));
return parse(tokenizer.tokenize());
}