iText parse html with RichText and images to pdf

来源：互联网发布：湖南北大青鸟学校java 编辑：程序博客网时间：2024/05/16 16:17

I use itextpdf to convert RichText to pdf and encountered many issues. Here are the three issues I want to talk about :

1.Tables in RichText turns into black box while using XMLWorkerHelper.

2.Line spacing in pdf doesn't look the same as html from the UI while using tag.

3.Position of Images in pdf doesn't follow the UI while handling <img/> tag with Image Class and treating the other content as a whole html.

issue1:Tables in RichText turns into black box while using XMLWorkerHelper.

former code (with issue):

Document doc = new Document(PageSize.LETTER); //create a new docPdfWriter writer = PdfWriter.getInstance(doc,os); //create a writer and associated with docdoc.open(); //open the docString content = getContent(paper.getContentId());//XMLWorker approachInputStream is = IOUtils.toInputStream(content);XMLWorkerHelper helper = XMLWorkerHelper.getInstance();helper.parseXHtml(writer, doc, is);

change to code (fix issue):

Document doc = new Document(PageSize.LETTER); //create a new docPdfWriter writer = PdfWriter.getInstance(doc,os); //create a writer and associated with docdoc.open(); //open the docString content = getContent(paper.getContentId());//HTMLWorker approachHTMLWorker htmlWorker = new HTMLWorker(doc);htmlWorker.parse(new StringReader(content));

Summary:Though HTMLWorker is deprecated and XMLWorkerHelper is new, XMLWorkerHelper seems to be able to handle text well but doesn't work well with some certain stuff like tables. The easiest way is to treat the content you want to convert to pdf as html because it shows exactly the same as RichText in html.

issue2 : Line spacing in pdf doesn't look the same as html from the UI while using tag.

This issue happens bacause tag's height is higher than in html while tag's height is the same as in pdf.

Solution : make tag with one more

private static String handlePTag(String content) {        content = content.replaceAll("<p></p>", "").replaceAll("<P></P>", "");        content = content.replaceAll("<p", "<br><p").replaceAll("<P", "<br><P");        content = content.replaceAll("</p>", "</p><br>").replaceAll("</P>", "</P><br>");        content = content.replaceAll("</p><br>\\s*<br><p", "</p><br><p").replaceAll("</P><br>\\s*<br><P", "</P><br><P");        return content;    }

issue3:Position of Images in pdf doesn't follow the UI while handling <img/> tag with Image Class and treating the other content as a whole html.

Description : This issue happens because we handle <img/> tag with Image class and convert the other content in RichText to pdf as a whole html. So all the images are added in fron of all the other content (eg,text) or all the images are after all the other content. And the position of images doesn't follow what user input in the richText from the UI.

Solution : ImageProvider. We provide a class for handling <img/> tag and doing appropriate changes while parsing every <img/> tag with certain parameters the interface ImageProvider provides.(We can get the src attribute of every <img/> tag and get the id of every img and get the Image object by image id, so we can return a Image Object while parsing a <img/> tag and doc can add the corresponding Image Object in the right position whenever a <img/> (with certain id) shows.)

Note: ImageProvider approach is from book 'itext in action'.There is a difficult problem while trying this approach. When we think of this approach, the first thing is to try this kind of content : content = "<img src=\"a.jpg\"/>". But it doesn't work. It's because of 2 things. One is that you have to put a.jpg in the right directory location (sometimes is : right under the folder of your model project, and sometimes is : right under the root drive(eg,E:\) of your class). The other is that you have to set height and width to the img, otherwise the img never shows in your pdf when the height or width of the img is bigger than your pdf.eg.content = "<img height="300",width="300" src=\"a.jpg\"/>".

Later you can see I set it in my ImageProvider:

image.scaleToFit(300f, 300f);

former code (with issue):

Document doc = new Document(PageSize.LETTER); //create a new docPdfWriter writer = PdfWriter.getInstance(doc,os); //create a writer and associated with docdoc.open(); //open the docString content = getContent(paper.getContentId());content = handleImageContent(doc,content);//HTMLWorker approachHTMLWorker htmlWorker = new HTMLWorker(doc);htmlWorker.parse(new StringReader(content));

some other related methods :

    private String handleImageContent(Document doc,String content) throws BadElementException,                                                           MalformedURLException,                                                           IOException,                                                           DocumentException {//        String content1 = "<br/><img src=\"/mps/imageServlet?id=1131\"/><br/><br/><br/><a href=\"/mps/attachmentServlet?id=1132&name=a.jpeg\" type=\"image/jpeg\" target=\"_blank\">a.jpeg</a><br/>";        List<String> ids = getImageIdList(content);//get all image id list        List<Image> images = new ArrayList<Image>();        RichTextService richTextService = (RichTextService)BeanLocator.getBean("richTextService");        for(String id:ids) {            logger.debug("IUploadService.handleImageContent->ids.id:"+id);            OutputStream os = new ByteArrayOutputStream();            richTextService.getImage(Integer.parseInt(id), os);            images.add(getImgObjFromIO(os));        }        addImagesToPDF(doc,images);//handle <img/> tag with Image class and add all the images to doc        return removeImageFromContent(content,ids);// remove src attribute from <img/> tag, so <img/> won't show images and doesn't show error if the image location doesn't exist    }

    private void addImagesToPDF(Document doc,List<Image> images) throws DocumentException {        for(Image img : images) {            doc.add(img);        }    }

change to code (fix issue):

Document doc = new Document(PageSize.LETTER); //create a new docPdfWriter writer = PdfWriter.getInstance(doc,os); //create a writer and associated with docdoc.open(); //open the doc//ImageProvider approachParagraph p = new Paragraph();HashMap<String,Object> map = new HashMap<String,Object>();map.put(HTMLWorker.IMG_PROVIDER, new ImgProvider());List<Element> list = HTMLWorker.parseToList(new StringReader(handlePTag(content)),null,map);for(Element e : list) {  p.add(e);}doc.add(p);

ImageProvider:

package sg.gov.cpf.corpapp.iwr.mps.common.util;import com.itextpdf.text.DocListener;import com.itextpdf.text.Image;import com.itextpdf.text.html.simpleparser.ChainedProperties;import com.itextpdf.text.html.simpleparser.ImageProvider;import java.io.ByteArrayOutputStream;import java.io.OutputStream;import java.util.Map;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import sg.gov.cpf.corpapp.iwr.mps.common.base.BeanLocator;import sg.gov.cpf.corpapp.iwr.mps.service.richtext.RichTextService;public class ImgProvider implements ImageProvider{    private static Log logger = LogFactory.getLog(ImgProvider.class);    public Image getImage(String string, Map<String, String> map,                          ChainedProperties chainedProperties,                          DocListener docListener) {        System.out.println("ImgProvider.getImage()->string : " + string);        logger.debug("ImgProvider.getImage()->string" + string);        String id = string.substring(string.indexOf("id=")+3);        System.out.println("ImgProvider.getImage()->id : " + id);        logger.debug("ImgProvider.getImage()->id : " + id);        RichTextService richTextService = (RichTextService)BeanLocator.getBean("richTextService");        OutputStream os = new ByteArrayOutputStream();        richTextService.getImage(Integer.parseInt(id), os);        Image image = null;        byte by[] = ((ByteArrayOutputStream)os).toByteArray();        try {            image = Image.getInstance(by);            image.scaleToFit(300f, 300f);            os.close();        } catch (Exception e) {            e.printStackTrace();        }        return image;    }}

example test content:

content = "<img src=\"/mps/imageServlet?id=1131\"/>";content = "<p><img style=\"WIDTH: 429px; HEIGHT: 1402px\" src=\"http://intrauat.cpf.gov.sg/mps/imageServlet?id=1397\" width=\"1897\" height=\"2160\"/></p>";

Runnable Test Class for Issue 3:

package sg.gov.cpf.corpapp.iwr.mps.service;import com.itextpdf.text.DocListener;import com.itextpdf.text.Document;import com.itextpdf.text.Element;import com.itextpdf.text.Image;import com.itextpdf.text.PageSize;import com.itextpdf.text.html.simpleparser.ChainedProperties;import com.itextpdf.text.html.simpleparser.HTMLWorker;import com.itextpdf.text.html.simpleparser.ImageProvider;import com.itextpdf.text.html.simpleparser.StyleSheet;import com.itextpdf.text.pdf.PdfWriter;import java.io.File;import java.io.FileOutputStream;import java.io.StringReader;import java.util.HashMap;import java.util.List;import java.util.Map;public class TestPdf implements ImageProvider {    public Image getImage(String string, Map<String, String> map,                          ChainedProperties chainedProperties,                          DocListener docListener) {        System.out.println(string);        Image image = null;        try {            image = Image.getInstance(string);            image.scaleToFit(300f, 300f);        } catch (Exception e) {            e.printStackTrace();        }        return image;    }    public static void testGeneratePdf() throws Exception {        String content =            "Testing Img<br/><p><img src=\"a.jpg\" width='300' height='300'/></p><p><br/>middle<br/><img src=\"b.jpg\" width='300' height='300'/></p>end";        //        String content = "Testing Img<br/><p><img src=\"/mps/imageServlet?id=1397\"/>";        Document doc = new Document(PageSize.LETTER);        File f = new File("e:/dayna.pdf");        System.out.println(f.getAbsolutePath());        PdfWriter.getInstance(doc, new FileOutputStream(f));        doc.open();        StyleSheet ss = new StyleSheet();        HashMap<String, Object> map = new HashMap<String, Object>();        map.put(HTMLWorker.IMG_PROVIDER, new TestPdf());        List<Element> list =            HTMLWorker.parseToList(new StringReader(content), ss, map);        for (Element e : list) {            doc.add(e);        }        doc.close();    }    public static void main(String[] args) throws Exception {        testGeneratePdf();    }}

a.jpg and b.jpg are put at location : E:\Workspace\mywork\MPS_SG_LOCAL\CpfAppsMpsModel

Result:

0 0

iText parse html with RichText and images to pdf

I use itextpdf to convert RichText to pdf and encountered many issues. Here are the three issues I want to talk about :

issue1:Tables in RichText turns into black box while using XMLWorkerHelper.

issue2 : Line spacing in pdf doesn't look the same as html from the UI while using <p> tag.

issue3:Position of Images in pdf doesn't follow the UI while handling <img/> tag with Image Class and treating the other content as a whole html.