Java CharsetDetector类代码示例

本文整理汇总了Java中com.ibm.icu.text.CharsetDetector类的典型用法代码示例。如果您正苦于以下问题：Java CharsetDetector类的具体用法？Java CharsetDetector怎么用？Java CharsetDetector使用的例子？那么, 这里精选的类代码示例或许可以为您提供帮助。

CharsetDetector类属于com.ibm.icu.text包，在下文中一共展示了CharsetDetector类的15个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于系统推荐出更棒的Java代码示例。

示例1: checkCharset

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
public static CharsetMatch checkCharset(InputStream input) {
	//		BufferedInputStream bis = new BufferedInputStream(input);
	CharsetDetector cd = new CharsetDetector();
	try {
		cd.setText(input);
	} catch (IOException e) {
		try {
			input.close();
		} catch (IOException e1) {
			e1.printStackTrace();
		}
		e.printStackTrace();
	}
	CharsetMatch cm = cd.detect();

	//		if (cm != null) {
	//			//reader = cm.getReader();
	//			return cm.getName();
	//		} else {
	//			throw new UnsupportedCharsetException(null);
	//		}
	return cm;
}

开发者ID:iotoasis，项目名称:SDA，代码行数:24，代码来源:FileUtil.java

示例2: detectEncoding

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
     * 利用 icu4j 探测输入流编码，只能探测文本类型的输入流
     * -
     * 抛弃 juniversalchardet
     *
     * @param in
     * @return
     * @throws IOException
     */
    public static Charset detectEncoding(InputStream in) throws IOException {
        final CharsetDetector detector = new CharsetDetector();
        detector.setText(in);

        final CharsetMatch charsetMatch = detector.detect();
        if (charsetMatch == null) {
            log.info("Cannot detect source charset.");
            return null;
        }
        //This is an integer from 0 to 100. The higher the value, the more confidence
        //探测的相似度在 1~100 之间，相似度越高结果越准确。
        int confidence = charsetMatch.getConfidence();
        final String name = charsetMatch.getName();
        log.info("CharsetMatch: {} ({}% 相似度，相似度小于 50% 时，可能编码无法判断。)", name, confidence);
        //打印该文本编码，所有可能性
//        CharsetMatch[] matches = detector.detectAll();
//        System.out.println("All possibilities : " + Arrays.asList(matches));
        return Charset.forName(name);
    }

开发者ID:h819，项目名称:spring-boot，代码行数:29，代码来源:MyCharsetUtils.java

示例3: getText

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
 * Extract text to be indexed
 */
public static String getText(String mimeType, String encoding, InputStream isContent) throws IOException {
	BufferedInputStream bis = new BufferedInputStream(isContent);
	TextExtractor te = engine.get(mimeType);
	String text = null;

	if (te != null) {
		if (mimeType.startsWith("text/") && encoding == null) {
			CharsetDetector detector = new CharsetDetector();
			detector.setText(bis);
			CharsetMatch cm = detector.detect();
			encoding = cm.getName();
		}

		text = te.extractText(bis, mimeType, encoding);
	} else {
		throw new IOException("Full text indexing of '" + mimeType + "' is not supported");
	}


	IOUtils.closeQuietly(bis);
	return text;
}

开发者ID:openkm，项目名称:document-management-system，代码行数:26，代码来源:RegisteredExtractors.java

示例4: showEncode

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
protected String showEncode(Document doc) {
  String charsetName = "";
  try {
    String convertedPlainText = doc.getText(0, doc.getLength());
    try (InputStream is = convertStringToStream(convertedPlainText)) {
      CharsetMatch charsetMatch = new CharsetDetector().setText(is).detect();
      charsetName = charsetMatch.getName();
      charsetName = charsetName != null ? charsetName : "NULL";
      if (isPoorMatch(charsetMatch.getConfidence())) {
        charsetName = verifyPossibleUtf8(charsetName, is);
      }
      charsetName += showByteOfMark(is);
    }
  } catch (BadLocationException | IOException ex) {
    Exceptions.printStackTrace(ex);
  }
  return charsetName;
}

开发者ID:maumss，项目名称:file-type-plugin，代码行数:19，代码来源:FileType.java

示例5: fileAnyEncodingToString

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
 * Read a text file detecting encoding using http://userguide.icu-project.org/conversion/detection
 * Return the file contents as a String.
 */
public static String fileAnyEncodingToString(File f) throws IOException {

  byte[] byteData = IOUtils.toByteArray(new FileInputStream(f));

  CharsetDetector detector = new CharsetDetector();

  String unicodeData = detector.getString(byteData, null);
  // Add to newline at the end of the file otherwise the subtitle parser library can get confused by EOF
  unicodeData += System.getProperty("line.separator") + System.getProperty("line.separator");
  CharsetMatch match = detector.detect();
  if (match != null && match.getConfidence() > 60) {
    LOGGER.debug("{} has a detected encoding: {}", f.getName(), match.getName());
    if (match.getLanguage() != null) {
      LOGGER.debug("{} has a detected language: {}", f.getName(), match.getLanguage());
    }
  }
  return unicodeData;
}

开发者ID:juliango202，项目名称:jijimaku，代码行数:23，代码来源:FileManager.java

示例6: guessEncoding

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
 * Detect charset encoding of a byte array
 * 
 * @param bytes: the byte array to detect encoding from
 * @return the charset encoding
 */
public static String guessEncoding(byte[] bytes) {
	UniversalDetector detector = new UniversalDetector(null);

	detector.handleData(bytes, 0, bytes.length);
	detector.dataEnd();

	String encoding = detector.getDetectedCharset();
	detector.reset();

	if (encoding == null || "MACCYRILLIC".equals(encoding)) {
		// juniversalchardet incorrectly detects windows-1256 as MACCYRILLIC
		// If encoding is MACCYRILLIC or null, we use ICU4J
		CharsetMatch detected = new CharsetDetector().setText(bytes).detect();
		if (detected != null) {
			encoding = detected.getName();
		} else {
			encoding = "UTF-8";
		}
	}

	return encoding;
}

开发者ID:dnbn，项目名称:submerge，代码行数:29，代码来源:FileUtils.java

示例7: getEncoding

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
protected String getEncoding( String requiredEncoding, File file, Log log )
    throws IOException
{
    FileInputStream fis = null;
    try
    {
        fis = new FileInputStream( file );
        CharsetDetector detector = new CharsetDetector();
        detector.setDeclaredEncoding( requiredEncoding );
        detector.setText( new BufferedInputStream( fis ) );
        CharsetMatch[] charsets = detector.detectAll();
        if ( charsets == null )
        {
            return null;
        }
        else
        {
            return charsets[0].getName();
        }
    }
    finally
    {
        IOUtil.close( fis );
    }
}

开发者ID:mojohaus，项目名称:extra-enforcer-rules，代码行数:26，代码来源:RequireEncoding.java

示例8: guessCharset

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
private Charset guessCharset(Path file, Charset charset) throws IOException {

        CharsetDetector detector = new CharsetDetector();
        byte[] data;

        try (SeekableByteChannel byteChannel = Files.newByteChannel(file, StandardOpenOption.READ)) {
            long size = byteChannel.size();

            if (size >= Integer.MAX_VALUE) {
                return guessCharsetChardet(file, charset);
            }

            int smallsize = (int) size;
            ByteBuffer buffer = ByteBuffer.allocate(smallsize);
            byteChannel.read(buffer);
            data = buffer.array();
        }

        detector.setText(data);
        CharsetMatch match = detector.detect();

        return Charset.forName(match.getName());
    }

开发者ID:rvdginste，项目名称:todo-teamcity-plugin，代码行数:24，代码来源:TodoPatternScanner.java

示例9: sniff

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
public Encoding sniff() throws IOException {
    try {
        CharsetDetector detector = new CharsetDetector();
        detector.setText(this);
        CharsetMatch match = detector.detect();
        Encoding enc = Encoding.forName(match.getName());
        Encoding actual = enc.getActualHtmlEncoding();
        if (actual != null) {
            enc = actual;
        }
        if (enc != Encoding.WINDOWS1252 && enc.isAsciiSuperset()) {
            return enc;
        } else {
            return null;
        }
    } catch (Exception e) {
        return null;
    }
}

开发者ID:google，项目名称:caja，代码行数:20，代码来源:IcuDetectorSniffer.java

示例10: parseContent

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
@Override
protected void parseContent(StreamLimiter streamLimiter, LanguageEnum lang)
		throws IOException {
	CharsetDetector detector = new CharsetDetector();
	BufferedInputStream bis = null;
	try {
		bis = new BufferedInputStream(streamLimiter.getNewInputStream());
		detector.setText(bis);
		CharsetMatch match = detector.detect();
		String content;
		if (match != null)
			content = match.getString();
		else
			content = IOUtils.toString(streamLimiter.getNewInputStream(), "UTF-8");
		ParserResultItem result = getNewParserResultItem();
		result.addField(ParserFieldEnum.content, content);
		result.langDetection(10000, ParserFieldEnum.content);
	} finally {
		IOUtils.close(bis);
	}
}

开发者ID:jaeksoft，项目名称:opensearchserver，代码行数:22，代码来源:TextParser.java

示例11: getCharsetFromText

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
 * Use a third party library as last resort to guess the charset from the
 * bytes.
 */
private static String getCharsetFromText(byte[] content,
        String declaredCharset, int maxLengthCharsetDetection) {
    String charset = null;
    // filter HTML tags
    CharsetDetector charsetDetector = new CharsetDetector();
    charsetDetector.enableInputFilter(true);
    // give it a hint
    if (declaredCharset != null)
        charsetDetector.setDeclaredEncoding(declaredCharset);
    // trim the content of the text for the detection
    byte[] subContent = content;
    if (maxLengthCharsetDetection != -1
            && content.length > maxLengthCharsetDetection) {
        subContent = Arrays.copyOfRange(content, 0,
                maxLengthCharsetDetection);
    }
    charsetDetector.setText(subContent);
    try {
        CharsetMatch charsetMatch = charsetDetector.detect();
        charset = validateCharset(charsetMatch.getName());
    } catch (Exception e) {
        charset = null;
    }
    return charset;
}

开发者ID:DigitalPebble，项目名称:storm-crawler，代码行数:30，代码来源:CharsetIdentification.java

示例12: toReader

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
public static Reader toReader(InputStream input) throws IOException {
	if (!input.markSupported())
		input = new BufferedInputStream(input);
	
	CharsetDetector charsetDetector = new CharsetDetector();
	charsetDetector.setText(input);
	
	CharsetMatch m = charsetDetector.detect();
	
	Reader reader;
	if (m.getConfidence() > 50) {
		reader = m.getReader();
	} else {
		reader = new InputStreamReader(input);
	}
	return reader;
}

开发者ID:pescuma，项目名称:buildhealth，代码行数:18，代码来源:EncodingHelper.java

示例13: getEncoding

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
public static String getEncoding(String text) {
	InputStream bis = new ByteArrayInputStream(text.getBytes());
	CharsetDetector detector = new CharsetDetector();
	try {
		detector.setText(bis);
	} catch (IOException e) {
		throw new RuntimeException(e);
	}
	String encoding = detector.detect().getName();
	return encoding;
}

开发者ID:BassJel，项目名称:Jouve-Project，代码行数:12，代码来源:CharSetUtils.java

示例14: detect

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
 * {@inheritDoc}
 */
@Override
public String detect(InputStream stream, String defaultEncoding) throws IOException {
	CharsetDetector detector = new CharsetDetector();
	detector.setText(stream);
	detector.setDeclaredEncoding(defaultEncoding);
	detector.enableInputFilter(true);
	CharsetMatch[] matches = detector.detectAll();
	String encoding = null;
	for (int i = 0; i < matches.length; i++) {
		// Ensure that the detected encoding is supported in Java.
		String candidateEncoding = matches[i].getName();
		if (isSupportedEncoding(candidateEncoding)) {
			encoding = candidateEncoding;
			break;
		}
	}
	return encoding;
}

开发者ID:WING-NUS，项目名称:search-engine-wrapper，代码行数:22，代码来源:ICUStreamCharsetDetector.java

示例15: detectEncoding

import com.ibm.icu.text.CharsetDetector; //导入依赖的package包/类
/**
 * Returns the detected encoding of the given byte array.
 *
 * @param input The data to detect the encoding for.
 * @param assume88591IfNotUtf8 True to assume that the encoding is ISO-8859-1 (the standard
 *     encoding for HTTP) if the bytes are not valid UTF-8. Only recommended if you can reasonably
 *     expect that other encodings are going to be specified. Full encoding detection is very
 *     expensive!
 * @return The detected encoding.
 */
public static Charset detectEncoding(byte[] input, boolean assume88591IfNotUtf8) {
  if (looksLikeValidUtf8(input)) {
    return UTF_8;
  }

  if (assume88591IfNotUtf8) {
    return ISO_8859_1;
  }

  // Fall back to the incredibly slow ICU. It might be better to just skip this entirely.
  CharsetDetector detector = new CharsetDetector();
  detector.setText(input);
  CharsetMatch match = detector.detect();
  return Charset.forName(match.getName().toUpperCase());
}

开发者ID:inevo，项目名称:shindig-1.1-BETA5-incubating，代码行数:26，代码来源:EncodingDetector.java

注：本文中的com.ibm.icu.text.CharsetDetector类示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。