word转HTML 最终版

来源：互联网发布：ubuntu 修改启动界面编辑：程序博客网时间：2024/06/05 14:31

最终版解决word 2007版本以后，并且有文档中有表格，或者采用poi不能转换情况

项目采用maven管理，依赖为

<!-- 调用openoffice，主要用于word2pdf转换 --><dependency>    <groupId>org.jodconverter</groupId>    <artifactId>jodconverter-spring-boot-starter</artifactId>    <version>4.0.0-RELEASE</version></dependency><!-- 用于读取html --><dependency>    <groupId>org.jsoup</groupId>    <artifactId>jsoup</artifactId>    <version>1.9.2</version></dependency>

具体代码如下

@Component("jodFileTypeConverter")public class JodFileTypeConverter {    @Autowired    OfficeDocumentConverter documentConverter;    public void office2html(File officeFile, File htmlFile) {        try {            DocumentFormat outputFormat = loadXhtml();            documentConverter.convert(officeFile, htmlFile, outputFormat);        } catch (OfficeException e) {            logger.error("文件转换异常：", e);            throw new RuntimeException("文件转换失败，请稍后再试");        }    }    private DocumentFormat loadXhtml() {        DocumentFormat xhtml = new DocumentFormat("XHTML", "xhtml",                "application/xhtml+xml");        xhtml.setStoreProperties(DocumentFamily.TEXT,                Collections.singletonMap("FilterName", "XHTML Writer File"));        xhtml.setStoreProperties(DocumentFamily.SPREADSHEET,                Collections.singletonMap("FilterName", "XHTML Calc File"));        xhtml.setStoreProperties(DocumentFamily.PRESENTATION,                Collections.singletonMap("FilterName", "XHTML Impress File"));        xhtml.setStoreProperties(DocumentFamily.DRAWING,                Collections.singletonMap("FilterName", "XHTML Draw File"));        return xhtml;    }    /**     * 得到 HTML String字符串     */    public String getHtml(){        File officeFile = "";        File htmlFile = "";        this.office2html(officeFile, htmlFile);        Document htmlDoc = Jsoup.parse(htmlFile, "UTF-8");         String htmlStr = this.formatHtml(htmlDoc.outerHtml());        return htmlStr;    }    public String formatHtml(String con) throws IOException {         ByteArrayInputStream stream = new ByteArrayInputStream(con.getBytes());         ByteArrayOutputStream  tidyOutStream = new ByteArrayOutputStream();         //实例化Tidy对象         Tidy tidy = new Tidy();         //设置输入         tidy.setInputEncoding("UTF-8");         //如果是true  不输出注释，警告和错误信息         tidy.setQuiet(true);         //设置输出         tidy.setOutputEncoding("UTF-8");         //不显示警告信息         tidy.setShowWarnings(false);         //缩进适当的标签内容。         tidy.setIndentContent(true);         //内容缩进         tidy.setSmartIndent(true);         tidy.setIndentAttributes(false);         //只输出body内部的内容         tidy.setPrintBodyOnly(false);         //多长换行         tidy.setWraplen(1024);         //输出为xhtml         tidy.setXHTML(true);         //类似xml输出         tidy.setXmlOut(true);         //去掉没用的标签         tidy.setMakeClean(true);         //去掉meta标签         tidy.setTidyMark(false);         //清洗word2000的内容         tidy.setWord2000(true);         //设置错误输出信息         tidy.setErrout(new PrintWriter(System.out));         tidy.parse(stream, tidyOutStream);         String content = tidyOutStream.toString();         org.jsoup.nodes.Document doc = Jsoup.parse(content);         if(!doc.getElementsByTag("div").isEmpty()) {             doc.getElementsByTag("div").first().removeAttr("style");         }         org.jsoup.nodes.Document.OutputSettings outputSettings = new org.jsoup.nodes.Document.OutputSettings();         outputSettings.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);         doc.outputSettings(outputSettings);         content=doc.html();         stream.close();         tidyOutStream.close();         return content;     }}

图片采用内部图片服务器处理，此处不作说明；上传文件后，将下载的文件替换图片src属性即可(<\img src=”*.png”>)

阅读全文

0 0