當前位置: 首頁>>代碼示例 >>用法及示例精選 >>正文


Java BodyContentHandler用法及代碼示例


阿帕奇蒂卡是一個允許您從不同文檔中提取數據的庫(.PDF, .DOCX, ETC。)。在本教程中,我們將使用 BodyContentHandler 提取數據。接下來將使用的依賴項如下所示:

<dependency>
<groupId>org.apache.tika < / groupId >
<artifactId>tika - parsers < / artifactId >
<version>1.26 < / version >
< / dependency >

BodyContentHandler是一個類裝飾器,允許獲取 XHTML 中的所有內容<正文> 標簽。 <正文> 或 <正文/>不會包含在結果值中。

我們先來討論一下這個類的各種構造函數如下:

BodyContentHandler() 將所有內容寫入內部字符串緩衝區,隻需調用即可獲取內容toString()。經過默認情況下,最大內容長度為 100 000 個字符。如果達到此限製,SAX異常將被拋出。
BodyContentHandler(寫入限製)

將所有內容寫入內部字符串緩衝區,隻需調用toString()即可獲取內容。

‘寫入限製’是可以讀取的最大字符數,設置-1以禁用該限製。如果達到此限製,SAX異常將被拋出。

BodyContentHandler(OutputStream 輸出流) 將所有內容寫入給定的輸出流。無任何內容限製。
BodyContentHandler(Writer 作家) 將所有內容寫入給定的寫入器。無任何內容限製。
BodyContentHandler(ContentHandler 處理程序) 將所有內容傳遞給給定的處理程序。

主題該類的方法如下:

方法 執行的操作
MatchingContentHandler 允許您通過以下方式獲取數據X路徑

Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.

執行:

示例 1:將所有內容讀入內部字符串緩衝區

Java


// Java Program to Read Everything into Inner String Buffer
// Main class
public class GFG {
    // Method 1
    // To parse the string
    public String parseToStringExample(String fileName)
        throws IOException, TikaException, SAXException
    {
        // Creating an object of InputStream class
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(fileName);
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        // Parsing the string
        parser.parse(stream, handler, metadata, context);
        return handler.toString();
    }
    // Method 2
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
        // Creating object of main class in main method
        GFG example = new GFG();
        // Display message for better readability
        System.out.println("Result");
        // Calling the method 1 to parse string by
        // providing file as an argument
        System.out.println(example.parseToStringExample(
            "test-reading.pdf"));
    }
}


輸出:

示例 2:將內容寫入文件並指定最大內容長度

Java


// Java Program to Write Content into File by
// Specifying the Maximum Content Length
// Main class
// BodyContentHandlerWriteToFileExample
public class GFG {
    // Method 1
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
        // Creating an object of the class
        GFG example = new GFG();
        // Calling the Method 2 in main() method and
        // passing the file and directory path as arguments
        // to it
        example.writeParsedDataToFile(
            "test-reading.pdf",
            "/Users/ali_zhagparov/Desktop/pdf-content.txt");
    }
    // Method 2
    // Writing parsed data to a file
    public void
    writeParsedDataToFile(String readFromFileName,
                          String writeToFileName)
        throws IOException, TikaException, SAXException
    {
        // Creating an object of InputStream
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(readFromFileName);
        // Creating an object of File class
        File yourFile = new File(writeToFileName);
        // If file is already existing then
        // no operations to be performed
        yourFile.createNewFile();
        FileOutputStream fileOutputStream
            = new FileOutputStream(yourFile, false);
        Parser parser = new AutoDetectParser();
        ContentHandler handler
            = new BodyContentHandler(fileOutputStream);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        parser.parse(stream, handler, metadata, context);
    }
}


輸出:

控製台窗口上沒有任何可見內容,因為它記錄了目錄映射,在這種情況下,它嘗試將所有信息寫入文件中

該程序會生成一個帶有“.pdf”文件內容的“.txt”,如下所示:



相關用法


注:本文由純淨天空篩選整理自alijakparovkz大神的英文原創作品 BodyContentHandler Class in Java。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。