Eclipse环境下，专利下载器的JAVA分析和实现

来源：互联网发布：技术支持盘古网络0310 编辑：程序博客网时间：2024/05/22 16:52

Eclipse环境下专利下载器的JAVA分析和实现

最近研究的，由于怕丢，所以Po在网上。

谁有兴趣就探讨下吧。。

一、设计要求

这是一个公司要的专利下载器的设计初步思路：

所谓专利下载其实就是指专利说明书的下载，这种文档在国家知识产权局可以查询到并且是免费公开的。但在知识产权局查看这种文档就有个很严重的问题就是只能以图片格式（TIF）一页页地看，而且还不能保存到本地，对于每天要查询上百个专业文档的人来讲简直就是个噩梦。我们希望做个小软件解决这个问题，基本原理是：

根据专利号 --> 分析TIF图片的真实下载网址 -- > 合并TIF图片为PDF

思路：

1. 用IE打开打开国家知识产权局首页http://www.sipo.gov.cn，右侧有个专利检索，输入专利号如：200410055632.4

2. 点击查询结果，顶部有个“申请公开说明书”或“审定授权说明书”，随便点一个进去；

3. 会提示安装浏览器插件, 安装后即可查看说明书（TIF图片）;

在第3步打开的页面（专利文档最终展示页）是类似这种格式：

http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=CN200410055632.4&leixin=fmzl&title=LCD%C6%F7%BC%FE%BA%CDLCD%CD%B6%D3%B0%D2%C7&ipc=G02F1/136#

需要做的是把上面这个网页里显示的图片的真实来源地址分析出来（右键查看源代码），是比较有规律的，如第1-3页的下载地址是：

http://211.157.104.92:808/books/fm/2005/2107/200410055632/000001.tif

http://211.157.104.92:808/books/fm/2005/2107/200410055632/000002.tif

http://211.157.104.92:808/books/fm/2005/2107/200410055632/000003.tif

最后，把获取到的TIF图片合并成PDF就大功告成，思路是简单的。

二、设计分析

这是个很棘手的问题，对我来说根本没有头绪，就是一头雾水，凡事都是从零开始嘛~哈哈

一、要确定如何去把图片通过软件转化为pdf格式的阅读文档，通过了解，JAVA的ITEXT类可以实现文字及图片的PDF转换！先试试可不可以制作再探讨以上的问题！

PS:ITEXT的安装，下载iText.jar以及iTextAsian.jar（支持亚洲字体）两个JRE系统库，在编译器中导入这两个JRE库便可以引用ITEXT类了。

分析：

·实现PDF的文字转换制作：经典Helloworld！

·代码如下：helloworld.java

package com.keith.test;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import com.itextpdf.text.Document;

import com.itextpdf.text.DocumentException;

import com.itextpdf.text.Paragraph;

import com.itextpdf.text.pdf.PdfWriter;

public class HelloWorld {

private static final String URL = "D:/ITEXT/helloworld.pdf";//存储路径

public static void main(String[] args) {

Document document = new Document();//实例化对象，document！

try {

PdfWriter.getInstance(document, new FileOutputStream(URL));

document.open();

document.add(new Paragraph("Hellow World!"));

document.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

} catch (DocumentException e) {

e.printStackTrace();

}

截图：

二、这个类居然可以实现自动制作PDF，接下来要来分析下要求给的那个地址，地址里面是TIF的图片，看看是否可以将TIF几张图片转换为PDF。

分析：

在专利说明书中可以看到图片（安装控件），右键查看代码，可在源代码里面地址里解析到http://211.157.104.92:808/books/fm/2005/2107/200410055632/000001.tif

这个图片的地址，这是第一张图片，而第二张图片是000002.tif，以后的图片依次类推，该专利号说明文档总共26张图片。

·接下来试试解析图片地址，实现制作专利号位200410055632的说明文档PDF文件。

·代码如下：

TifftoPdf.java

import java.io.File;

import java.io.FileOutputStream;

import java.io.IOException;

import com.itextpdf.text.*;

import com.itextpdf.text.pdf.PdfWriter;

public class TifftoPdf{

public static void main(String[] args) throws Exception{

Document pdfDoc = new Document();

try {

FileOutputStream pdfFile =

new FileOutputStream(new File("D:/ITEXT/TIFtoPDF/Tifpdf.pdf"));//将要生成的 pdf 文件的路径输出流Image image1 = Image.getInstance("http://211.157.104.92:808/books/fm/2005/2107/200410055632/000001.tif");

image1.setAlignment(Element.ALIGN_CENTER);

image1.scalePercent(13);//经测试，首页比其它页面显示率大，顾单独分析！

Image []image;

image=new Image [25];

for(int i=2;i<=26;i++){

int j=i-2;

if(i<10){//大于10的下面灵活处理，不过应该两端代码还是可以合并。有点赶，没有去试试。。

image[j] = Image.getInstance("http://211.157.104.92:808/books/fm/2005/2107/200410055632/00000"+String.valueOf(i)+".tif");

image[j].setAlignment(Image.ALIGN_CENTER);

image[j].scalePercent(20);

}

else{

image[j] = Image.getInstance("http://211.157.104.92:808/books/fm/2005/2107/200410055632/0000"+String.valueOf(i)+".tif");

image[j].setAlignment(Image.ALIGN_CENTER);

image[j].scalePercent(20);

}

PdfWriter.getInstance(pdfDoc, pdfFile);//用 Document 对象、File 对象获得 PdfWriter 输出流对象

pdfDoc.open(); // 打开 Document 文档

pdfDoc.add(image1);//添加一个TIF图片

for(int i=2;i<=26;i++){

int j=i-2;

pdfDoc.add(image[j]);

}

catch (DocumentException de) {

System.err.println(de.getMessage());

}

catch (IOException ioe) {

System.err.println(ioe.getMessage());

}

pdfDoc.close();

}

//林祎澄于2010-10-18 凌晨3点完成！

由于没有去精简代码，所以代码有些繁琐，截图如下：

首页

尾页

26页的说明文档，tif格式的图片被我下载下来，并且做成PDF格式的阅读文档，非常方便阅读。以上两个测试说明可以去开发一个专利下载器，所以接下来要系统的分析下如何去根据专利号码来解析其说明文档的网络地址。

三、解析网络地址。

1、要解析网络地址，就要分析其搜索规律，在“中华人民共和国国家知识产权局” http://www.sipo.gov.cn中可以看到

这个搜索栏，输入专利号之后可以跳到这个页面。

而点击之后又跳到

点击 审定授权说明书（26）页才可以看到文档

右键可以看到源码。

2、这些操作步骤，初步分析为：

（1）首先要输入专利号码

（2）然后通过专利号码去获取搜索后的页面1

（3）点击专利号码进入页面2

（4）再在页面2中点击说明才进到所需要的页面

但是这样的思路要建立网页的响应，比较麻烦，现在还没有能力实现。

所以继续分析，在第三个页面，其链接为：

http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=CN200410055632.4&leixin=fmzl&title=LCD器件和LCD投影仪&ipc=G02F1/136#

可以看到recid=**&**&**&，这里就是几个共同可以实现的链接ID号。就是可以解析为：

1、http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=CN200410055632.4

2、http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid= leixin=fmzl

3、http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=title=LCD器件和LCD投影仪

4、http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=ipc=G02F1/136#

这四个链接都可以成立。所以我们解析1链接就可以了。

3、要解析链接1 的网址的源代码，还得去制作一个网页代码解析的程序，以下是在网上copy的下载源代码的JAVA程序，非常感谢这位作者，因为网上很多源码不能用。

·代码：SourceHtml.java

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.net.Authenticator;//可以提供用户名和密码

import java.net.HttpURLConnection;

import java.net.PasswordAuthentication;

import java.net.URL;

public class Zhua {

// 一个public方法，返回字符串，错误则返回"error open url"

public static String getContent(String strUrl) {

try {

URL url = new URL(strUrl);

BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(),"utf-8"));

//打开url中的链接。

String s = "";

StringBuffer sb = new StringBuffer("");

while ((s = br.readLine()) != null) {

sb.append(s + "/r/n");

}

br.close();

return sb.toString();

} catch (Exception e) {

return "error open url:" + strUrl;

}

public static void initProxy(String host, int port, final String username,

final String password) {

Authenticator.setDefault(new Authenticator() {//setDefault()可以提供用户名和密码

protected PasswordAuthentication getPasswordAuthentication() {

return new PasswordAuthentication(username,new String(password).toCharArray());

}

});

System.setProperty("http.proxyType", "4");

System.setProperty("http.proxyPort", Integer.toString(port));

System.setProperty("http.proxyHost", host);

System.setProperty("http.proxySet", "true");

}

public static void main(String[] args) throws IOException {

String url = "http://www.hao123.com";

String proxy = "";

int port = 8080;

String username = "username";

String password = "password";

String curLine = "";

String content = "";

URL server = new URL(url);

initProxy(proxy, port, username, password);

HttpURLConnection connection = (HttpURLConnection) server.openConnection();

connection.connect();

InputStream is = connection.getInputStream();

BufferedReader reader = new BufferedReader(new InputStreamReader(is));

while ((curLine = reader.readLine()) != null) {

content = content + curLine+ "/r/n";

}

System.out.println("content= " + content);

is.close();

}

这是个可以实现在网页右键查看源码一样功能的程序。

5、4、接下来，就是填入我们所需要的网址在里面，http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=CN200410055632.4

而String proxy 填入211.157.104.87这个代理。

接下来获取到整个网页的源码，而里面恰好有

<ahref=#onclick=javascript:browsetif('fm','26','books/fm/2005/2107/200410055632/000001.tif');>

这一句，里面包含了页码'26'，还有绝对路径'books/fm/2005/2107/200410055632/000001.tif'

所以要解析这些出来，要通过正则表达式来解析出来这些字符串，因此经过分析将以上的所有代码整合，写出了以下的程序

·代码：GetPdf.java

(com.tiftopdf包里有：GetUrlInfomation.java，Transpdf.java, Zhze.java，ZhzeSzu.java)

(GetUrlInfomation.java 作用：设置代理，以及设置端口号等url解析信息；

Transpdf.java 作用：制作pdf过程函数

Zhze.java 作用：利用正则表达式获取下面这句：

<ahref=#onclick=javascript:browsetif('fm','26','books/fm/2005/2107/200410055632/000001.tif');>

ZhzeSzu.java 作用：分离出来：

26，以及books/fm/2005/2107/200410055632/ 放在一个数组里面

)

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.net.HttpURLConnection;

import java.net.URL;

import com.itextpdf.text.Document;

import com.tiftopdf.*;

public class GetPdf {

static String d;

static int e;

public static void main(String[] args) throws IOException {

int i=0;

System.out.println("输入您的专利号码：");

do{

BufferedReader br=new BufferedReader(new InputStreamReader(System.in));

String j;

j=br.readLine();

String url = "http://211.157.104.87:8080/sipo/zljs/hyjs-yx-new.jsp?recid=CN"+j+"";

String proxy = "http://211.157.104.87";

int port = 80;

String username = "username";

String password = "password";

String curLine = "";

String content = "";

URL server = new URL(url);

GetUrlInfomation.initProxy(proxy, port, username, password);

HttpURLConnection connection = (HttpURLConnection) server.openConnection();

connection.connect();

InputStream is = connection.getInputStream();//打开到此 URL 的连接并返回一个用于从该连接读入的 InputStream。

BufferedReader reader = new BufferedReader(new InputStreamReader(is));

while ((curLine = reader.readLine()) != null) {

content = content + curLine+ "/r/n";

}

String bds="",result1="";

bds="<a//s+href=#[^>]*>";

Zhze a1=new Zhze(content);

result1=a1.Zhengze(bds);

bds="[/'](.*?)[/']";

ZhzeSzu a2=new ZhzeSzu(result1);

a2.ZzSzu(bds);

d=a2.Zhushi;

e=a2.Yema;

if(d!=null){

Transpdf transpdf=new Transpdf();

Document pdfDoc = new Document();

transpdf.Ttiftopdf(d, j,e, pdfDoc);

System.out.println("该专利说明书总共"+e+"页,/n"+"现已制作完成，请查阅！");

}

else{

System.out.println("专利号有误！请于下方重新输入：");

i=-1;

}

while(i==-1);

}

运行的结果为：

而制作的结果PDF文件放在

OK，成功运行！

但是还是有几点缺陷：

1、修改下载下来的pdf的命名格式，获取其中文名，改为中文名.pdf。

2、分类一些功能，使程序的功能分化强，调度更加简便，阅读也方便！（done）

3、简化某些功能代码，使之可以达到精炼！（done）

4、实现图形化界面，这是终极目标！