抓取网页信息并获取生成xml文件(以网页彩票数据为例)

来源:互联网 发布:andrew marc牌子知乎 编辑:程序博客网 时间:2024/04/30 08:47

一、网页抓取

使用httpclient抓取  传入网页url

public static String clientTest(String url){
@SuppressWarnings("deprecation")
HttpClient hc=new DefaultHttpClient();
HttpGet get=new HttpGet(url);
String backContent="";
try {
HttpResponse response=hc.execute(get);
HttpEntity entity = response.getEntity();
if (entity != null) {
InputStream is = entity.getContent();
BufferedReader in = new BufferedReader(new InputStreamReader(is));
StringBuffer buffer = new StringBuffer();
String line = "";
while ((line = in.readLine()) != null) {
buffer.append(line);
}
//end 读取整个页面内容
backContent = buffer.toString();
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return backContent;
}

二、截取所需部分

传入获取的网页信息,截取开始的的字符串和结束的字符串(如:"<tbody id=\"cpdata\">","</tbody>")
public static String sub(String str,String beginstr,String endstr){
int b=str.indexOf(beginstr);
int e=str.indexOf(endstr);
int le=endstr.length();
String result=str.substring(b, e+le);
return result;
}

三、替换掉不符合xml规则的字符串

例如: data-foldGroup=1 在xml里面算错误的
str=str.replaceAll("data-foldGroup=1", " ");


四、最后就是保存了

将处理符合xml规则的字符串保存为path这个文件下(path文件完整路径)
public static void saveFile(String str,String path){
File file=new File(path);
PrintWriter pfp=null;
try {
pfp= new PrintWriter(file);
pfp.print(str);
pfp.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
pfp.close();
}
}

整个调用过程


public static void main(String[] args) {
String url="http://trend.caipiao.163.com/ssq/?beginPeriod=2015001&endPeriod=2015118";
String str=clientTest(url);
//System.out.println(str);
str=sub(str,"<tbody id=\"cpdata\">","</tbody>");
str=str.replaceAll("data-foldGroup=1", " ");
str=str.replaceAll("data-foldColor=ball_red", " ");
str=str.replaceAll("data-award=1", " ");
str=str.replaceAll("data-foldColor=ball_blue", " ");
System.out.println("<?xml version=\"1.0\" encoding=\"utf-8\"?>"+str);
String path="D:/java/hello.xml";
saveFile(str,path);
}
0 0