在软件项目中,经常需要将给定文件(HTML/TXT/等)转换为 PDF 文件,类似地,任何 PDF 文件都需要转换为 HTML/TXT/等文件。即使 PDF 也需要存储为 PNG 或 GIF 等类型的图像,通过示例 Maven 项目,让我们看看同样的情况。由于是maven项目,需要在pom.xml中添加必要的依赖
基本库是 PDF2Dom:
<!-- To load the selected PDF file --> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox-tools</artifactId> <version>2.0.25</version> </dependency> <!-- To load the selected PDF file --> <!-- Required for conversion --> <dependency> <groupId>net.sf.cssbox</groupId> <artifactId>pdf2dom</artifactId> <version>2.0.1</version> </dependency>
还需要更多的依赖项。需要 iText 从给定的 PDF 文件中提取文本。创建 .docx 文档需要 POI。
<dependency> <groupId>com.itextpdf</groupId> <artifactId>itextpdf</artifactId> <version>5.5.10</version> </dependency> <dependency> <groupId>com.itextpdf.tool</groupId> <artifactId>xmlworker</artifactId> <version>5.5.10</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.15</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>3.15</version> </dependency>
Maven 项目示例
让我们从项目结构和 pom.xml 开始,然后查找从 PDF 转换为其他格式以及从其他格式转换为 HTML 所需的源代码
pom.xml
XML
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<artifactId>pdf</artifactId>
<name>pdf</name>
<url>http://maven.apache.org</url>
<parent>
<groupId>com.gfg</groupId>
<artifactId>parent-modules</artifactId>
<version>1.0.0-SNAPSHOT</version>
</parent>
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox-tools</artifactId>
<version>${pdfbox-tools.version}</version>
<exclusions>
<exclusion>
<artifactId>commons-logging</artifactId>
<groupId>commons-logging</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>net.sf.cssbox</groupId>
<artifactId>pdf2dom</artifactId>
<version>${pdf2dom.version}</version>
<exclusions>
<exclusion>
<artifactId>commons-logging</artifactId>
<groupId>commons-logging</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>${itextpdf.version}</version>
</dependency>
<dependency>
<groupId>com.itextpdf.tool</groupId>
<artifactId>xmlworker</artifactId>
<version>${xmlworker.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>${poi-scratchpad.version}</version>
</dependency>
<dependency>
<groupId>org.apache.xmlgraphics</groupId>
<artifactId>batik-transcoder</artifactId>
<version>${batik-transcoder.version}</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>${poi-ooxml.version}</version>
</dependency>
<dependency>
<groupId>org.thymeleaf</groupId>
<artifactId>thymeleaf</artifactId>
<version>${thymeleaf.version}</version>
</dependency>
<dependency>
<groupId>org.xhtmlrenderer</groupId>
<artifactId>flying-saucer-pdf</artifactId>
<version>${flying-saucer-pdf.version}</version>
</dependency>
<dependency>
<groupId>org.xhtmlrenderer</groupId>
<artifactId>flying-saucer-pdf-openpdf</artifactId>
<version>${flying-saucer-pdf-openpdf.version}</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>${jsoup.version}</version>
</dependency>
<dependency>
<groupId>com.openhtmltopdf</groupId>
<artifactId>openhtmltopdf-core</artifactId>
<version>${open-html-pdf-core.version}</version>
</dependency>
<dependency>
<groupId>com.openhtmltopdf</groupId>
<artifactId>openhtmltopdf-pdfbox</artifactId>
<version>${open-html-pdfbox.version}</version>
</dependency>
</dependencies>
<build>
<finalName>pdf</finalName>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
</build>
<properties>
<pdfbox-tools.version>2.0.25</pdfbox-tools.version>
<pdf2dom.version>2.0.1</pdf2dom.version>
<itextpdf.version>5.5.10</itextpdf.version>
<xmlworker.version>5.5.10</xmlworker.version>
<poi-scratchpad.version>3.15</poi-scratchpad.version>
<batik-transcoder.version>1.8</batik-transcoder.version>
<poi-ooxml.version>3.15</poi-ooxml.version>
<thymeleaf.version>3.0.11.RELEASE</thymeleaf.version>
<flying-saucer-pdf.version>9.1.20</flying-saucer-pdf.version>
<open-html-pdfbox.version>1.0.6</open-html-pdfbox.version>
<open-html-pdf-core.version>1.0.6</open-html-pdf-core.version>
<flying-saucer-pdf-openpdf.version>9.1.22</flying-saucer-pdf-openpdf.version>
<jsoup.version>1.14.2</jsoup.version>
</properties>
</project>
让我们看看重要的关键文件
1. PDF和HTML转换
ConversionOfPDF2HTMLExample.java
在下面的程序中,两种方法都被处理,即
a. generationOfHTMLFromPDF
Note: Conversion of PDF to HTML cannot be predicted 100%, pixel-to-pixel result oriented. If the complexity of the PDF file is more, accuracy varies.
b. generationOfPDFFromHTML
Note: In html file, all tags need to properly closed and then only PDF can be generated
Java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.Writer;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.fit.pdfdom.PDFDomTree;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
public class ConversionOfPDF2HTMLExample {
private static final String PDF = "src/main/resources/pdf.pdf";
private static final String HTML = "src/main/resources/html.html";
public static void main(String[] args) {
try {
generationOfHTMLFromPDF(PDF);
generationOfPDFFromHTML(HTML);
} catch (IOException | ParserConfigurationException | DocumentException e) {
e.printStackTrace();
}
}
private static void generationOfHTMLFromPDF(String filename) throws ParserConfigurationException, IOException {
PDDocument pdf = PDDocument.load(new File(filename));
PDFDomTree parser = new PDFDomTree();
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
parser.writeText(pdf, output);
output.close();
if (pdf != null) {
pdf.close();
}
}
private static void generationOfPDFFromHTML(String filename) throws ParserConfigurationException, IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("src/output/html.pdf"));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document, new FileInputStream(filename));
document.close();
}
}
2. PDF和图像转换
PDF 可以通过多种方式转换为图像,其中一种重要的方式是 Apache PDFBox,可以使用 iText 将图像再次转换为 PDF
ConversionOfPDF2ImageExample.java
在下面的程序中,处理了以下方法
- 从图像生成PDF
- 图像类型为 jpeg、jpg、gif、tiff 或 png,可以从磁盘加载
- PDF生成图像
- Apache PDFBox 是一个高级工具。 PDF 的每一页都必须使用 PDFRenderer 作为 BufferedImage 进行渲染。然后ImageIOUtil用于写入JPEG、GIF、PNG等类型的图像,
Java
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;
import com.itextpdf.text.BadElementException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
public class ConversionOfPDF2ImageExample {
private static final String PDF = "src/main/resources/pdf.pdf";
private static final String JPG = "http://cdn2.gfg.netdna-cdn.com/wp-content/uploads/2016/05/gfg-rest-widget-main-1.2.0";
private static final String GIF = "https://media.giphy.com/media/l3V0x6kdXUW9M4ONq/giphy";
public static void main(String[] args) {
try {
generationOfImageFromPDF(PDF, "png");
generationOfImageFromPDF(PDF, "jpeg");
generationOfImageFromPDF(PDF, "gif");
generationOfPDFFromImage(JPG, "jpg");
generationOfPDFFromImage(GIF, "gif");
} catch (IOException | DocumentException e) {
e.printStackTrace();
}
}
private static void generationOfImageFromPDF(String filename, String extension) throws IOException {
PDDocument document = PDDocument.load(new File(filename));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
ImageIOUtil.writeImage(bim, String.format("src/output/pdf-%d.%s", page + 1, extension), 300);
}
document.close();
}
private static void generationOfPDFFromImage(String filename, String extension)
throws IOException, BadElementException, DocumentException {
Document document = new Document();
String input = filename + "." + extension;
String output = "src/output/" + extension + ".pdf";
FileOutputStream fos = new FileOutputStream(output);
PdfWriter writer = PdfWriter.getInstance(document, fos);
writer.open();
document.open();
document.add(Image.getInstance((new URL(input))));
document.close();
writer.close();
}
}
3. PDF和文本转换
为此,还需要 Apache PDFBox 从 PDF 文件获取文本,并且需要 iText 进行 text-to-pdf 转换。
Note: cannot preserve the formatting in a plain text file as it has text only
ConversionOfPDF2TextExample.java
Java
import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Element;
import com.itextpdf.text.Font;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
public class ConversionOfPDF2TextExample {
private static final String PDF = "src/main/resources/pdf.pdf";
private static final String TXT = "src/main/resources/txt.txt";
public static void main(String[] args) {
try {
generationOfTxtFromPDF(PDF);
generationOfPDFFromTxt(TXT);
} catch (IOException | DocumentException e) {
e.printStackTrace();
}
}
private static void generationOfTxtFromPDF(String filename) throws IOException {
File f = new File(filename);
String parsedText;
PDFParser parser = new PDFParser(new RandomAccessFile(f, "r"));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
PrintWriter pw = new PrintWriter("src/output/pdf.txt");
pw.print(parsedText);
pw.close();
}
private static void generationOfPDFFromTxt(String filename) throws IOException, DocumentException {
Document pdfDoc = new Document(PageSize.A4);
PdfWriter.getInstance(pdfDoc, new FileOutputStream("src/output/txt.pdf"))
.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
pdfDoc.open();
Font myfont = new Font();
myfont.setStyle(Font.NORMAL);
myfont.setSize(11);
pdfDoc.add(new Paragraph("\n"));
BufferedReader br = new BufferedReader(new FileReader(filename));
String strLine;
while ((strLine = br.readLine()) != null) {
Paragraph para = new Paragraph(strLine + "\n", myfont);
para.setAlignment(Element.ALIGN_JUSTIFIED);
pdfDoc.add(para);
}
pdfDoc.close();
br.close();
}
}
4. PDF 和DocX 转换
需要两个库。 IE。
- iText:从 PDF 中提取文本
- POI:创建 .docx 文档
ConversionOfPDF2WordExample.java
Java
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.xwpf.usermodel.BreakType;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
public class ConversionOfPDF2WordExample {
private static final String FILENAME = "src/main/resources/pdf.pdf";
public static void main(String[] args) {
try {
generationOfDocFromPDF(FILENAME);
} catch (IOException e) {
e.printStackTrace();
}
}
private static void generationOfDocFromPDF(String filename) throws IOException {
XWPFDocument doc = new XWPFDocument();
String pdf = filename;
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
}
FileOutputStream out = new FileOutputStream("src/output/pdf.docx");
doc.write(out);
out.close();
reader.close();
doc.close();
}
}
代码解释视频:
结论
在软件项目的许多阶段,都有将文本、图像转换为PDF的需求,同样也有将PDF数据转换为文本、图像、Docx格式的需求。上面的示例是在 Java 中实现此目的的最佳方法。
相关用法
- Java Double equals()用法及代码示例
- Java Double floatVlaue()用法及代码示例
- Java Double isFinite()用法及代码示例
- Java Double longBitsToDouble()用法及代码示例
- Java Double max()用法及代码示例
- Java Double min()用法及代码示例
- Java Double sum()用法及代码示例
- Java Double compareTo()用法及代码示例
- Java Double toHexString()用法及代码示例
- Java Double toString()用法及代码示例
- Java Double valueOf()用法及代码示例
- Java Double doubleTorRawLongBits()用法及代码示例
- Java Double byteValue()用法及代码示例
- Java Double compare()用法及代码示例
- Java Double.compareTo()用法及代码示例
- Java Double doubleToLongBits()用法及代码示例
- Java Double doubleToRawLongBits()用法及代码示例
- Java Double doubleValue()用法及代码示例
- Java Double.equals()用法及代码示例
- Java Double floatValue()用法及代码示例
- Java Double hashCode()用法及代码示例
- Java Double intValue()用法及代码示例
- Java Double isInfinite()用法及代码示例
- Java Double isNaN()用法及代码示例
- Java Double longValue()用法及代码示例
注:本文由纯净天空筛选整理自priyarajtt大神的英文原创作品 How to Convert a Document to PDF in Java?。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。