当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


Java Document转PDF用法及代码示例


在软件项目中,经常需要将给定文件(HTML/TXT/等)转换为 PDF 文件,类似地,任何 PDF 文件都需要转换为 HTML/TXT/等文件。即使 PDF 也需要存储为 PNG 或 GIF 等类型的图像,通过示例 Maven 项目,让我们看看同样的情况。由于是maven项目,需要在pom.xml中添加必要的依赖

基本库是 PDF2Dom:

<!-- To load the selected PDF file -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox-tools</artifactId>
    <version>2.0.25</version>
</dependency>
<!-- To load the selected PDF file -->

<!-- Required for conversion -->
<dependency>
    <groupId>net.sf.cssbox</groupId>
    <artifactId>pdf2dom</artifactId>
    <version>2.0.1</version>
</dependency>

还需要更多的依赖项。需要 iText 从给定的 PDF 文件中提取文本。创建 .docx 文档需要 POI。

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.10</version>
</dependency>
<dependency>
    <groupId>com.itextpdf.tool</groupId>
    <artifactId>xmlworker</artifactId>
    <version>5.5.10</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.15</version>
</dependency>

Maven 项目示例

让我们从项目结构和 pom.xml 开始,然后查找从 PDF 转换为其他格式以及从其他格式转换为 HTML 所需的源代码

pom.xml

XML


<?xml version="1.0"?> 
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0  
                        http://maven.apache.org/xsd/maven-4.0.0.xsd"> 
    <modelVersion>4.0.0</modelVersion> 
    <artifactId>pdf</artifactId> 
    <name>pdf</name> 
    <url>http://maven.apache.org</url> 
  
    <parent> 
        <groupId>com.gfg</groupId> 
        <artifactId>parent-modules</artifactId> 
        <version>1.0.0-SNAPSHOT</version> 
    </parent> 
  
    <dependencies> 
        <dependency> 
            <groupId>org.apache.pdfbox</groupId> 
            <artifactId>pdfbox-tools</artifactId> 
            <version>${pdfbox-tools.version}</version> 
            <exclusions> 
                <exclusion> 
                    <artifactId>commons-logging</artifactId> 
                    <groupId>commons-logging</groupId> 
                </exclusion> 
            </exclusions> 
        </dependency> 
        <dependency> 
            <groupId>net.sf.cssbox</groupId> 
            <artifactId>pdf2dom</artifactId> 
            <version>${pdf2dom.version}</version> 
            <exclusions> 
                <exclusion> 
                    <artifactId>commons-logging</artifactId> 
                    <groupId>commons-logging</groupId> 
                </exclusion> 
            </exclusions> 
        </dependency> 
        <dependency> 
            <groupId>com.itextpdf</groupId> 
            <artifactId>itextpdf</artifactId> 
            <version>${itextpdf.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>com.itextpdf.tool</groupId> 
            <artifactId>xmlworker</artifactId> 
            <version>${xmlworker.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.apache.poi</groupId> 
            <artifactId>poi-scratchpad</artifactId> 
            <version>${poi-scratchpad.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.apache.xmlgraphics</groupId> 
            <artifactId>batik-transcoder</artifactId> 
            <version>${batik-transcoder.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.apache.poi</groupId> 
            <artifactId>poi-ooxml</artifactId> 
            <version>${poi-ooxml.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.thymeleaf</groupId> 
            <artifactId>thymeleaf</artifactId> 
            <version>${thymeleaf.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.xhtmlrenderer</groupId> 
            <artifactId>flying-saucer-pdf</artifactId> 
            <version>${flying-saucer-pdf.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.xhtmlrenderer</groupId> 
            <artifactId>flying-saucer-pdf-openpdf</artifactId> 
            <version>${flying-saucer-pdf-openpdf.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.jsoup</groupId> 
            <artifactId>jsoup</artifactId> 
            <version>${jsoup.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>com.openhtmltopdf</groupId> 
            <artifactId>openhtmltopdf-core</artifactId> 
            <version>${open-html-pdf-core.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>com.openhtmltopdf</groupId> 
            <artifactId>openhtmltopdf-pdfbox</artifactId> 
            <version>${open-html-pdfbox.version}</version> 
        </dependency> 
    </dependencies> 
  
    <build> 
        <finalName>pdf</finalName> 
        <resources> 
            <resource> 
                <directory>src/main/resources</directory> 
                <filtering>true</filtering> 
            </resource> 
        </resources> 
    </build> 
  
    <properties> 
        <pdfbox-tools.version>2.0.25</pdfbox-tools.version> 
        <pdf2dom.version>2.0.1</pdf2dom.version> 
        <itextpdf.version>5.5.10</itextpdf.version> 
        <xmlworker.version>5.5.10</xmlworker.version> 
        <poi-scratchpad.version>3.15</poi-scratchpad.version> 
        <batik-transcoder.version>1.8</batik-transcoder.version> 
        <poi-ooxml.version>3.15</poi-ooxml.version> 
        <thymeleaf.version>3.0.11.RELEASE</thymeleaf.version> 
        <flying-saucer-pdf.version>9.1.20</flying-saucer-pdf.version> 
        <open-html-pdfbox.version>1.0.6</open-html-pdfbox.version> 
        <open-html-pdf-core.version>1.0.6</open-html-pdf-core.version> 
        <flying-saucer-pdf-openpdf.version>9.1.22</flying-saucer-pdf-openpdf.version> 
        <jsoup.version>1.14.2</jsoup.version> 
    </properties> 
  
</project>

让我们看看重要的关键文件

1. PDF和HTML转换

ConversionOfPDF2HTMLExample.java

在下面的程序中,两种方法都被处理,即

a. generationOfHTMLFromPDF

Note: Conversion of  PDF to HTML cannot be predicted 100%, pixel-to-pixel result oriented. If the complexity of the PDF file is more, accuracy varies.

b. generationOfPDFFromHTML

Note: In html file, all tags need to properly closed and then only PDF can be generated

Java


import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.PrintWriter; 
import java.io.Writer; 
  
import javax.xml.parsers.ParserConfigurationException; 
  
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.fit.pdfdom.PDFDomTree; 
  
import com.itextpdf.text.Document; 
import com.itextpdf.text.DocumentException; 
import com.itextpdf.text.pdf.PdfWriter; 
import com.itextpdf.tool.xml.XMLWorkerHelper; 
  
public class ConversionOfPDF2HTMLExample { 
  
    private static final String PDF = "src/main/resources/pdf.pdf"; 
    private static final String HTML = "src/main/resources/html.html"; 
  
    public static void main(String[] args) { 
        try { 
            generationOfHTMLFromPDF(PDF); 
            generationOfPDFFromHTML(HTML); 
        } catch (IOException | ParserConfigurationException | DocumentException e) { 
            e.printStackTrace(); 
        } 
    } 
  
    private static void generationOfHTMLFromPDF(String filename) throws ParserConfigurationException, IOException { 
        PDDocument pdf = PDDocument.load(new File(filename)); 
        PDFDomTree parser = new PDFDomTree(); 
        Writer output = new PrintWriter("src/output/pdf.html", "utf-8"); 
        parser.writeText(pdf, output); 
        output.close(); 
        if (pdf != null) { 
            pdf.close(); 
        } 
    } 
  
    private static void generationOfPDFFromHTML(String filename) throws ParserConfigurationException, IOException, DocumentException { 
        Document document = new Document(); 
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("src/output/html.pdf")); 
        document.open(); 
        XMLWorkerHelper.getInstance().parseXHtml(writer, document, new FileInputStream(filename)); 
        document.close(); 
    } 
} 

2. PDF和图像转换

PDF 可以通过多种方式转换为图像,其中一种重要的方式是 Apache PDFBox,可以使用 iText 将图像再次转换为 PDF

ConversionOfPDF2ImageExample.java

在下面的程序中,处理了以下方法

  • 从图像生成PDF
    • 图像类型为 jpeg、jpg、gif、tiff 或 png,可以从磁盘加载
  • PDF生成图像
    • Apache PDFBox 是一个高级工具。 PDF 的每一页都必须使用 PDFRenderer 作为 BufferedImage 进行渲染。然后ImageIOUtil用于写入JPEG、GIF、PNG等类型的图像,

Java


import java.awt.image.BufferedImage; 
import java.io.File; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.net.URL; 
  
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.rendering.ImageType; 
import org.apache.pdfbox.rendering.PDFRenderer; 
import org.apache.pdfbox.tools.imageio.ImageIOUtil; 
  
import com.itextpdf.text.BadElementException; 
import com.itextpdf.text.Document; 
import com.itextpdf.text.DocumentException; 
import com.itextpdf.text.Image; 
import com.itextpdf.text.pdf.PdfWriter; 
  
public class ConversionOfPDF2ImageExample { 
  
    private static final String PDF = "src/main/resources/pdf.pdf"; 
    private static final String JPG = "http://cdn2.gfg.netdna-cdn.com/wp-content/uploads/2016/05/gfg-rest-widget-main-1.2.0"; 
    private static final String GIF = "https://media.giphy.com/media/l3V0x6kdXUW9M4ONq/giphy"; 
  
    public static void main(String[] args) { 
        try { 
            generationOfImageFromPDF(PDF, "png"); 
            generationOfImageFromPDF(PDF, "jpeg"); 
            generationOfImageFromPDF(PDF, "gif"); 
            generationOfPDFFromImage(JPG, "jpg"); 
            generationOfPDFFromImage(GIF, "gif"); 
        } catch (IOException | DocumentException e) { 
            e.printStackTrace(); 
        } 
    } 
  
    private static void generationOfImageFromPDF(String filename, String extension) throws IOException { 
        PDDocument document = PDDocument.load(new File(filename)); 
        PDFRenderer pdfRenderer = new PDFRenderer(document); 
        for (int page = 0; page < document.getNumberOfPages(); ++page) { 
            BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); 
            ImageIOUtil.writeImage(bim, String.format("src/output/pdf-%d.%s", page + 1, extension), 300); 
        } 
        document.close(); 
    } 
  
    private static void generationOfPDFFromImage(String filename, String extension) 
            throws IOException, BadElementException, DocumentException { 
        Document document = new Document(); 
        String input = filename + "." + extension; 
        String output = "src/output/" + extension + ".pdf"; 
        FileOutputStream fos = new FileOutputStream(output); 
        PdfWriter writer = PdfWriter.getInstance(document, fos); 
        writer.open(); 
        document.open(); 
        document.add(Image.getInstance((new URL(input)))); 
        document.close(); 
        writer.close(); 
    } 
  
} 

3. PDF和文本转换

为此,还需要 Apache PDFBox 从 PDF 文件获取文本,并且需要 iText 进行 text-to-pdf 转换。

Note: cannot preserve the formatting in a plain text file as it has text only

ConversionOfPDF2TextExample.java

Java


import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileOutputStream; 
import java.io.FileReader; 
import java.io.IOException; 
import java.io.PrintWriter; 
  
import org.apache.pdfbox.cos.COSDocument; 
import org.apache.pdfbox.io.RandomAccessFile; 
import org.apache.pdfbox.pdfparser.PDFParser; 
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.text.PDFTextStripper; 
  
import com.itextpdf.text.Document; 
import com.itextpdf.text.DocumentException; 
import com.itextpdf.text.Element; 
import com.itextpdf.text.Font; 
import com.itextpdf.text.PageSize; 
import com.itextpdf.text.Paragraph; 
import com.itextpdf.text.pdf.PdfWriter; 
  
public class ConversionOfPDF2TextExample { 
  
    private static final String PDF = "src/main/resources/pdf.pdf"; 
    private static final String TXT = "src/main/resources/txt.txt"; 
  
    public static void main(String[] args) { 
        try { 
            generationOfTxtFromPDF(PDF); 
            generationOfPDFFromTxt(TXT); 
        } catch (IOException | DocumentException e) { 
            e.printStackTrace(); 
        } 
    } 
  
    private static void generationOfTxtFromPDF(String filename) throws IOException { 
        File f = new File(filename); 
        String parsedText; 
        PDFParser parser = new PDFParser(new RandomAccessFile(f, "r")); 
        parser.parse(); 
  
        COSDocument cosDoc = parser.getDocument(); 
  
        PDFTextStripper pdfStripper = new PDFTextStripper(); 
        PDDocument pdDoc = new PDDocument(cosDoc); 
  
        parsedText = pdfStripper.getText(pdDoc); 
  
        if (cosDoc != null) 
            cosDoc.close(); 
        if (pdDoc != null) 
            pdDoc.close(); 
  
        PrintWriter pw = new PrintWriter("src/output/pdf.txt"); 
        pw.print(parsedText); 
        pw.close(); 
    } 
  
    private static void generationOfPDFFromTxt(String filename) throws IOException, DocumentException { 
        Document pdfDoc = new Document(PageSize.A4); 
        PdfWriter.getInstance(pdfDoc, new FileOutputStream("src/output/txt.pdf")) 
                .setPdfVersion(PdfWriter.PDF_VERSION_1_7); 
        pdfDoc.open(); 
          
        Font myfont = new Font(); 
        myfont.setStyle(Font.NORMAL); 
        myfont.setSize(11); 
        pdfDoc.add(new Paragraph("\n")); 
          
        BufferedReader br = new BufferedReader(new FileReader(filename)); 
        String strLine; 
        while ((strLine = br.readLine()) != null) { 
            Paragraph para = new Paragraph(strLine + "\n", myfont); 
            para.setAlignment(Element.ALIGN_JUSTIFIED); 
            pdfDoc.add(para); 
        } 
          
        pdfDoc.close(); 
        br.close(); 
    } 
  
} 

4. PDF 和DocX 转换

需要两个库。 IE。

  • iText:从 PDF 中提取文本
  • POI:创建 .docx 文档

ConversionOfPDF2WordExample.java

Java


import java.io.FileOutputStream; 
import java.io.IOException; 
  
import org.apache.poi.xwpf.usermodel.BreakType; 
import org.apache.poi.xwpf.usermodel.XWPFDocument; 
import org.apache.poi.xwpf.usermodel.XWPFParagraph; 
import org.apache.poi.xwpf.usermodel.XWPFRun; 
  
import com.itextpdf.text.pdf.PdfReader; 
import com.itextpdf.text.pdf.parser.PdfReaderContentParser; 
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy; 
import com.itextpdf.text.pdf.parser.TextExtractionStrategy; 
  
public class ConversionOfPDF2WordExample { 
  
    private static final String FILENAME = "src/main/resources/pdf.pdf"; 
  
    public static void main(String[] args) { 
        try { 
            generationOfDocFromPDF(FILENAME); 
        } catch (IOException e) { 
            e.printStackTrace(); 
        } 
    } 
  
    private static void generationOfDocFromPDF(String filename) throws IOException { 
        XWPFDocument doc = new XWPFDocument(); 
  
        String pdf = filename; 
        PdfReader reader = new PdfReader(pdf); 
        PdfReaderContentParser parser = new PdfReaderContentParser(reader); 
  
        for (int i = 1; i <= reader.getNumberOfPages(); i++) { 
            TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy()); 
            String text = strategy.getResultantText(); 
            XWPFParagraph p = doc.createParagraph(); 
            XWPFRun run = p.createRun(); 
            run.setText(text); 
            run.addBreak(BreakType.PAGE); 
        } 
        FileOutputStream out = new FileOutputStream("src/output/pdf.docx"); 
        doc.write(out); 
        out.close(); 
        reader.close(); 
        doc.close(); 
    } 
} 

代码解释视频:

结论

在软件项目的许多阶段,都有将文本、图像转换为PDF的需求,同样也有将PDF数据转换为文本、图像、Docx格式的需求。上面的示例是在 Java 中实现此目的的最佳方法。



相关用法


注:本文由纯净天空筛选整理自priyarajtt大神的英文原创作品 How to Convert a Document to PDF in Java?。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。