当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


Java BodyContentHandler用法及代码示例


阿帕奇蒂卡是一个允许您从不同文档中提取数据的库(.PDF, .DOCX, ETC。)。在本教程中,我们将使用 BodyContentHandler 提取数据。接下来将使用的依赖项如下所示:

<dependency>
<groupId>org.apache.tika < / groupId >
<artifactId>tika - parsers < / artifactId >
<version>1.26 < / version >
< / dependency >

BodyContentHandler是一个类装饰器,允许获取 XHTML 中的所有内容<正文> 标签。 <正文> 或 <正文/>不会包含在结果值中。

我们先来讨论一下这个类的各种构造函数如下:

BodyContentHandler() 将所有内容写入内部字符串缓冲区,只需调用即可获取内容toString()。经过默认情况下,最大内容长度为 100 000 个字符。如果达到此限制,SAX异常将被抛出。
BodyContentHandler(写入限制)

将所有内容写入内部字符串缓冲区,只需调用toString()即可获取内容。

‘写入限制’是可以读取的最大字符数,设置-1以禁用该限制。如果达到此限制,SAX异常将被抛出。

BodyContentHandler(OutputStream 输出流) 将所有内容写入给定的输出流。无任何内容限制。
BodyContentHandler(Writer 作家) 将所有内容写入给定的写入器。无任何内容限制。
BodyContentHandler(ContentHandler 处理程序) 将所有内容传递给给定的处理程序。

主题该类的方法如下:

方法 执行的操作
MatchingContentHandler 允许您通过以下方式获取数据X路径

Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.

执行:

示例 1:将所有内容读入内部字符串缓冲区

Java


// Java Program to Read Everything into Inner String Buffer
// Main class
public class GFG {
    // Method 1
    // To parse the string
    public String parseToStringExample(String fileName)
        throws IOException, TikaException, SAXException
    {
        // Creating an object of InputStream class
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(fileName);
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        // Parsing the string
        parser.parse(stream, handler, metadata, context);
        return handler.toString();
    }
    // Method 2
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
        // Creating object of main class in main method
        GFG example = new GFG();
        // Display message for better readability
        System.out.println("Result");
        // Calling the method 1 to parse string by
        // providing file as an argument
        System.out.println(example.parseToStringExample(
            "test-reading.pdf"));
    }
}


输出:

示例 2:将内容写入文件并指定最大内容长度

Java


// Java Program to Write Content into File by
// Specifying the Maximum Content Length
// Main class
// BodyContentHandlerWriteToFileExample
public class GFG {
    // Method 1
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
        // Creating an object of the class
        GFG example = new GFG();
        // Calling the Method 2 in main() method and
        // passing the file and directory path as arguments
        // to it
        example.writeParsedDataToFile(
            "test-reading.pdf",
            "/Users/ali_zhagparov/Desktop/pdf-content.txt");
    }
    // Method 2
    // Writing parsed data to a file
    public void
    writeParsedDataToFile(String readFromFileName,
                          String writeToFileName)
        throws IOException, TikaException, SAXException
    {
        // Creating an object of InputStream
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(readFromFileName);
        // Creating an object of File class
        File yourFile = new File(writeToFileName);
        // If file is already existing then
        // no operations to be performed
        yourFile.createNewFile();
        FileOutputStream fileOutputStream
            = new FileOutputStream(yourFile, false);
        Parser parser = new AutoDetectParser();
        ContentHandler handler
            = new BodyContentHandler(fileOutputStream);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        parser.parse(stream, handler, metadata, context);
    }
}


输出:

控制台窗口上没有任何可见内容,因为它记录了目录映射,在这种情况下,它尝试将所有信息写入文件中

该程序会生成一个带有“.pdf”文件内容的“.txt”,如下所示:



相关用法


注:本文由纯净天空筛选整理自alijakparovkz大神的英文原创作品 BodyContentHandler Class in Java。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。