POST获取网易博客数据(网页抓取，模拟登陆资料学习备份）

来源：互联网发布：淘宝查重会泄露论文吗编辑：程序博客网时间：2024/05/18 01:22

下面这个日志网站（http://www.crifan.com/）的类别“Category Archives: Crawl_emulatelogin”：

http://www.crifan.com/category/work_and_job/web/crawl_emulatelogin/

里有很多网页解析和抓取以及模拟登陆的学习资料，并给出了个博客搬家的工具：BlogsToWordPress，功能很强大，但也因为过于强大，需要很多时间去折腾，我当时主要用到下载网易博客数据的功能。想详细了解可以去根据标题找相关信息。

因为网易博客（http://blog.163.com）博主日志目录的数据是动态加载的，例如清华大学肖鹰的博客日志目录：

http://xying1962.blog.163.com/blog/ (通常显示后面还有"#m=0"：http://xying1962.blog.163.com/blog/#m=0)

如图所示：

直接通过HttpClient一次请求“http://xying1962.blog.163.com/blog/”是得不到博客的数据的（如图红色方框所示），而是需要另外一次POST请求

"http://api.blog.163.com/xying1962/dwr/call/plaincall/

BlogBeanNew.getBlogs.dwr",下面这篇日志就是分析如何去POST请求网易的".dwr"数据：

【教程】以抓取网易博客帖子中的最近读者信息为例，手把手教你如何抓取动态网页中的内容

该日志是分析抓取网易博客读者信息的，请求的是：VisitBeanNew.getBlogReaders.dwr，抓取博客内容则请求：BlogBeanNew.getBlogs.dwr，都是通过POST请求，原理是类似，设置基本一样。

看完了分析，就该看代码了，有兴趣的可以去看整个BlogsToWordPress工具的Python代码，如果想只看POST代码，可以看这篇日志：

【记录】用Python解析网易163博客的心情随笔FeelingCard返回的DWR-REPLY数据

其实这篇说得还繁琐的，想看更简洁的，可以看下面这篇：

【记录】给BlogsToWordPress添加支持导出网易的心情随笔

我列出的这三篇日志基本把解析网易博客日志数据如何设置并请求POST说清楚了，里面用的是Python写的。下面呢，是我参考后用Java实现的请求用户博客数据的完整代码。

首先说下，网易博客的目录数据是动态加载的，需要POST请求.dwr，但博客内容是静态的，可以通过GET请求网址就可获取，例如肖鹰的一篇博客：

肖鹰：晚明文人为何发狂？

地址是：

http://xying1962.blog.163.com/blog/static/138445490201310207320529/

我的目的是获得“肖鹰：晚明文人为何发狂”这篇日志的内容，只需要通过一次GET请求它的地址就可以获取，然后这个地址又是比较格式化的，例如只要解析出了最后这串数字“138445490201310207320529”就可以拼接出完整地址，整个地址格式是：

http://[userName].blog.163.com/blog/static/[blogId]

肖鹰博客的username：“xying1962”是可以通过入口地址“http://xying1962.blog.163.com/blog/”获取的，后面的blogId就需要解析目录数据才能获取了，所以才需要POST请求.dwr。

另外，说明下网易博客地址，地址格式有两种（具体到博客目录地址）：

1. http://[username].blog.163.com/blog/

2. http://blog.163.com/[username]/blog/

在给出Java代码前，我得说下，Google的Chrome浏览器真是好产品，连请求监测也做得那么好，是网页分析的好帮手，个人觉得比Wireshark好用，详细使用如下：

1、右键单击网页某处，选择最末项的“Inspect Element”，好像中文叫“审查元素”，如图：

出来了“Inspect element”审查元素框后，点击“Network”，中文版应该是“网络”，并刷新网页，就可以看到网页监测情况，如下图所示：

可以查看HTTP请求的名字（name），请求的方式（Method），请求的状态（Status）和请求的返回结果类型（Type）。单击最左侧的Name，就可以查看详细的信息，例如单击“blog/”，图示如下:

可以查看Headers信息，返回的结果“Response”以及Cookies，有时候模拟登陆进行网页请求需要用到Cookies，但很多时候Headers和Response就够用了，如果想清楚当前的信息，重新查看，点击底部的“Clear”按钮（如图，红色方框圈出）就可以了。具体怎么使用，如果学过计算机网络，做过抓包分析，自己查看一下就都明白了。如果没有，还真需要花点时间了解下。

下面就说明如何在Java里设置POST请求，先按照类似原文Python格式上Java代码

public Set<String> post163Blog(String username, String userId, int startIndex, int returnNumber){/*** entityBody用于保存字符串格式的返回结果*/String entityBody = null;/*** 实例化一个HttpPost，并设置请求dwr地址，username表示博主的用户名，例如肖鹰的username是“xying1962”*/HttpPost httppost = new HttpPost("http://api.blog.163.com/" + username + "/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr");/** 设置参数，除了c0-param0、c0-param1和c0-param2外都一样。* c0-param0 ：博主的userId，例如肖鹰的userId是“138445490”* c0-param1 ：返回博客数据的起始项，从0开始* c0-param2 ：一次返回博客的数量，最大值好像是500，具体多少我没有完全去试，600肯定不行，我一般设置500，600以上就不返回数据了。* 如果一个博主写了超过500篇博客，那就可以分多次请求，只要合理设置c0-param1和c0-param2就可以。*/List<NameValuePair> nvp = new ArrayList<NameValuePair>();nvp.add(new BasicNameValuePair("callCount", "1"));nvp.add(new BasicNameValuePair("scriptSessionId", "${scriptSessionId}187"));nvp.add(new BasicNameValuePair("c0-scriptName", "BlogBeanNew"));nvp.add(new BasicNameValuePair("c0-methodName", "getBlogs"));nvp.add(new BasicNameValuePair("c0-id", "0"));nvp.add(new BasicNameValuePair("c0-param0", "number:" + userId));nvp.add(new BasicNameValuePair("c0-param1", "number:" + startIndex));nvp.add(new BasicNameValuePair("c0-param2", "number:" + (returnNumber <= 500 ? returnNumber : 500)));nvp.add(new BasicNameValuePair("batchId", "1"));try{httppost.setEntity(new UrlEncodedFormEntity(nvp, "UTF8"));httppost.addHeader("Referer", "http://api.blog.163.com/crossdomain.html?t=20100205");httppost.addHeader("Content-Type", "text/plain");//httppost.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");HttpResponse response = httpclient.execute(httppost);HttpEntity entity = response.getEntity();if(entity != null){/*** 把返回结果转换成字符串的形式，这里编码设置其实无所谓，因为我只需要解析出blogId，而且POST请求返回的是unicode，还需要转码，我嫌麻烦就没有去弄，也没必要去弄。*/entityBody = EntityUtils.toString(entity, "UTF8");}} catch (Exception e){e.printStackTrace();} finally {/*** 请求结束，关闭httppost，释放空间，注意，一定要在获取返回结果(response.getEntity())之后再释放，因为一旦关闭了httppost，* response也就关闭了，把返回结果也释放了。*/httppost.abort();}/*** blogIdSet用来保存blogId，POST请求返回结果里，blogId以三种形式出现：* 1. permalink="blo/static/[blogId]"* 2. trackbackUrl="blog/[blogId].track"* 3. permaSerial="[blogId]"* 其中第三种的permaSerial=后面肯定是紧跟blogId的，用这种方式可以解析得到纯净的blogId，而且进一步提取blogId也比较简单，其他两种具体我没有去试，* 但应该也是可以得到纯净的blogId，有兴趣的可以把entityBody值打印出来自己去看看，下面是解析POST请求返回结果提取blogId，使用HashSet的一个好处是* 可以不用每次都判断blogId是否已经出现，可以少些几行代码，不要用ArrayList，因为每个blogId的permaSerial="[blogId]"形式会出现两次，如果需要提取* 其他信息诸如标题可以考虑用HashMap<String, InfoStruct>（HashMap<blogId, 数据信息>）*/Set<String> blogIdSet = new HashSet<String>();/*** 设置匹配的正则表达式，其中\"[0-9]+?\"中的问号"?"是最小匹配的意思，如果不用?，就可能得不到纯净的blogId。*/Pattern pattern = Pattern.compile("permaSerial=\"[0-9]+?\"");/*** 先对返回结果进行分句，再对每一句进行匹配，其实也可以不用分句，直接匹配，只是个人习惯先分句而已，防止跨句。*/String[] sents = entityBody.split("(\n|\r\n)+");for(int i = 0; i < sents.length; i++){Matcher matcher = pattern.matcher(sents[i]);while(matcher.find()){blogIdSet.add(matcher.group().replaceAll("permaSerial=|\"", ""));}}return blogIdSet;}

获取了blogId后就可以拼接博客地址并请求博客内容数据了。【哎，我得感慨下，为了写这篇日志，还把英文注释改成了中文注释，并添加了很多新的注释】

post163Blog(String username, String userId, int startIndex, int returnNumber)中的参数里，startIndex和returnNumber可以根据需要设定，而username，userId是传进去的，但给定一个博客入口地址，我们只能从入口地址获取username，userId是没有的，这就需要另外去解析提取userId了。

userId可以在一次GET请求博客入口地址的返回结果里找到。例如在肖鹰例子里，GET请求

“http://xying1962.blog.163.com/blog/”的返回结果里看到“userId:138445490”，如下图所示（可以用上面的网页分析神器Chrome查看，在Response里）：

这个userId信息是保存在<script>...</script>里的，可以使用HtmlCleaner进行解析或者直接用字符串正则匹配就可以提取出来，例如上述post163Blog函数里提取blogId用到的正则匹配。正则表达式模板是：

Pattern pattern = Pattern.compile("userId:[0-9]+");

我这里也给出根据GET请求博客目录地址并解析返回结果获取userId的代码，以供参考。

/** * Get the html text through a GET request, the default encoding is "UTF8" * */public String getText(String inputUrl){return getText(inputUrl, "UTF8");}public String getText(String inputUrl, String encoding){/*** 实例化一个新的HttpGet，并添加Header*/HttpGet httpget = new HttpGet();httpget.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");String entityBody = null;try{/*** 设置要请求的页面地址*/httpget.setURI(new URI(inputUrl));HttpResponse response = httpclient.execute(httpget);/*** 获取返回结果并转换成字符串形式*/HttpEntity entity = response.getEntity();if(entity != null){entityBody = EntityUtils.toString(entity, encoding);}/*** 关闭httpget，释放资源，及时释放资源是个好习惯。*/httpget.abort();} catch (Exception e) {e.printStackTrace();} finally {}/*** 返回请求的返回结果，entityBody一般是个html页面的源代码，也可能不是，看对方网站服务器以什么形式返回结果。*/return entityBody;}

/** * 解析GET请求博客目录返回结果，获取博主的userId，userId是博主的唯一标识。 * userId隐藏在script代码里。这里会用到工具包HtmlCleaner。 * 这个代码做的检查是过于小心了，因为我没有详细去分析返回结果是否包含其他人的userId， * 但我的检查可以保证提取出来的是博主正确的userId * */public String parseReturnHtml(String htmlText){if(htmlText == null)return null;TagNode rootNode = htmlcleaner.clean(htmlText);try {/*** 提取<script>...</script>内容，从后往前是因为看userId藏在较低端的script代码里。*/Object[] scriptNodes = rootNode.evaluateXPath("//script");for(int i = scriptNodes.length - 1; i >= 0; i--){TagNode scriptNode = (TagNode) scriptNodes[i];String text = scriptNode.getText().toString().trim();if(! text.startsWith("window.N"))continue;if(! text.contains("userId"))continue;/*** 分句*/String[] sents = text.split("\n|\r\n");for(int j = sents.length - 1; j >= 0; j--){if(! sents[j].contains("userId"))continue;sents[j] = sents[j].trim();String[] items = sents[j].split(":");if(items.length != 2)return null;String userId = items[1];/** * userId是一个数字串 * */return userId;}break;}} catch (XPatherException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {}return null;}

其中，getText函数是对网页进行GET请求，获得返回结果，这个函数是通用的。parseReturnHtml只是解析GET请求网易博客目录的返回结果而已。

这就是获取网易博客数据的关键代码了。

下面给出完整可执行代码，需要去下载两个jar软件包：

htmlcleaner

httpclient

可能还需要下面httpcore这个jar软件包，如果用上面两个还不够，就把这个也加上。【注，貌似httpclient和httpcore是一块放在httpcomponents的，我记不得了，自己看看就清楚了】

import java.net.URI;import java.util.ArrayList;import java.util.HashSet;import java.util.Iterator;import java.util.List;import java.util.Set;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.apache.http.HeaderElement;import org.apache.http.HttpEntity;import org.apache.http.HttpResponse;import org.apache.http.NameValuePair;import org.apache.http.client.HttpClient;import org.apache.http.client.entity.UrlEncodedFormEntity;import org.apache.http.client.methods.HttpGet;import org.apache.http.client.methods.HttpPost;import org.apache.http.impl.client.DefaultHttpClient;import org.apache.http.message.BasicNameValuePair;import org.apache.http.util.EntityUtils;import org.htmlcleaner.HtmlCleaner;import org.htmlcleaner.TagNode;import org.htmlcleaner.XPatherException;public class WangyiBlogCrawler {/** * For http request and html cleaning and parsing * */private HttpClient httpclient;private HtmlCleaner htmlcleaner;private int STARTINDEX;private int RETURNNUMBER;public WangyiBlogCrawler(){httpclient = new DefaultHttpClient();htmlcleaner = new HtmlCleaner();STARTINDEX = 0;RETURNNUMBER = 100;}public static void main(String[] args) {// TODO Auto-generated method stubString contentUrl = "http://xying1962.blog.163.com/blog/";WangyiBlogCrawler wyBlogCrawler = new WangyiBlogCrawler();wyBlogCrawler.run(contentUrl);}public void run(String contentUrl){String username = contentUrl.replaceAll("http://|.?blog.163.com/?|/?blog/|#m=0", "");;String returnEntity = getText(contentUrl);String userId = parseReturnHtml(returnEntity);int startIndex = STARTINDEX;int returnNumber = RETURNNUMBER;Set<String> blogIdSet = new HashSet<String>();Set<String> temIdSet = null;do{startIndex += returnNumber;returnNumber = RETURNNUMBER;temIdSet = post163Blog(username, userId, startIndex, returnNumber);blogIdSet.addAll(temIdSet);}while(temIdSet.size() == returnNumber);processBlogIdSet(contentUrl, blogIdSet);}public void processBlogIdSet(String contentUrl, Set<String> blogIdSet){contentUrl = contentUrl.replaceAll("#m=0", "");for(Iterator<String> iter = blogIdSet.iterator(); iter.hasNext(); ){String blogId = iter.next();/*** 拼接产生博客内容的地址*/String blogUrl = contentUrl + "static/" + blogId + "/";/** * output the blog url * */System.out.println(blogUrl);/** * output the blog entity * */ /** * 下面两行代码请求每一篇博客内容并打印出完整的html文本 *//String blogEntity = getText(blogUrl, "gbk");//System.out.println(blogEntity);}}/** * Parsing the entry html in order to extract the unique userId. * The unique userId is hidden in the script codes. * */public String parseReturnHtml(String htmlText){if(htmlText == null)return null;TagNode rootNode = htmlcleaner.clean(htmlText);try {Object[] scriptNodes = rootNode.evaluateXPath("//script");for(int i = scriptNodes.length - 1; i >= 0; i--){TagNode scriptNode = (TagNode) scriptNodes[i];String text = scriptNode.getText().toString().trim();if(! text.startsWith("window.N"))continue;if(! text.contains("userId"))continue;String[] sents = text.split("\n|\r\n");for(int j = sents.length - 1; j >= 0; j--){if(! sents[j].contains("userId"))continue;sents[j] = sents[j].trim();String[] items = sents[j].split(":");if(items.length != 2)return null;String userId = items[1];/** * the userId is a sequence numbers. * */return userId;}break;}} catch (XPatherException e) {// TODO Auto-generated catch blocke.printStackTrace();} finally {}return null;}public Set<String> post163Blog(String username, String userId, int startIndex, int returnNumber){String entityBody = null;HttpPost httppost = new HttpPost("http://api.blog.163.com/" + username + "/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr");List<NameValuePair> nvp = new ArrayList<NameValuePair>();nvp.add(new BasicNameValuePair("callCount", "1"));nvp.add(new BasicNameValuePair("scriptSessionId", "${scriptSessionId}187"));nvp.add(new BasicNameValuePair("c0-scriptName", "BlogBeanNew"));nvp.add(new BasicNameValuePair("c0-methodName", "getBlogs"));nvp.add(new BasicNameValuePair("c0-id", "0"));nvp.add(new BasicNameValuePair("c0-param0", "number:" + userId));nvp.add(new BasicNameValuePair("c0-param1", "number:" + startIndex));nvp.add(new BasicNameValuePair("c0-param2", "number:" + (returnNumber <= 500 ? returnNumber : 500)));nvp.add(new BasicNameValuePair("batchId", "1"));try{httppost.setEntity(new UrlEncodedFormEntity(nvp, "UTF8"));httppost.addHeader("Referer", "http://api.blog.163.com/crossdomain.html?t=20100205");httppost.addHeader("Content-Type", "text/plain");httppost.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");HttpResponse response = httpclient.execute(httppost);HttpEntity entity = response.getEntity();if(entity != null){entityBody = EntityUtils.toString(entity, "UTF8");}} catch (Exception e){e.printStackTrace();} finally {httppost.abort();}Set<String> blogIdSet = new HashSet<String>();Pattern pattern = Pattern.compile("permaSerial=\"[0-9]+?\"");String[] sents = entityBody.split("(\n|\r\n)+");for(int i = 0; i < sents.length; i++){Matcher matcher = pattern.matcher(sents[i]);while(matcher.find()){blogIdSet.add(matcher.group().replaceAll("permaSerial=|\"", ""));}}return blogIdSet;}/** * Get the html text through a GET request * */public String getText(String inputUrl){return getText(inputUrl, "UTF8");}public String getText(String inputUrl, String encoding){HttpGet httpget = new HttpGet();httpget.addHeader("User-Agent", "Mozilla/5.0 Firefox/3.5.9 Chrome/26.0.1410.64");String entityBody = null;try{httpget.setURI(new URI(inputUrl));HttpResponse response = httpclient.execute(httpget);HttpEntity entity = response.getEntity();if(entity != null){/** * If you want extract the charset automatically, unannotated the following * the statements * getMeta函数和getCharset函数是用于自动获取编码的，在getText里调用，在抓取具体博客内容时可能或产生乱码， * 即EntityUtils.toString(entity, encoding)这条语句执行过程中可能会出现乱码，因此在不知道编码方式的时候 * 可以使用下面的语句自动获取，属于两次解析，第一次是用getCharset获取，使用html的标签结果来提取，即一般 * 的网页都有<head>里都有这条语句， * <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> * 但用解析器解析有时候得不到charset，或者有些网页就不是这种形式，而是很简单的 * <meta charset="utf-8"> * 这就需要用自己用字符串处理的方式去提取，这样一般都能解析到，但是先把response返回的结果转换成字符串， * 而response貌似只能保存一次，因而用字符串提取charset又需要一次GET请求，代价比较高，因此我才想这种笨重 * 的多次解析多次请求，为的是解决乱码问题。如果是抓同一个网站的东西，可以直接设好编码方式。 *//**String charset = getCharset(entity);if(charset == null){entityBody = EntityUtils.toString(entity);charset = getMeta(entityBody); response = httpclient.execute(httpget);entity = response.getEntity();}if(charset != null)encoding = charset;*/entityBody = EntityUtils.toString(entity, encoding);}httpget.abort();} catch (Exception e) {e.printStackTrace();} finally {}return entityBody;}public String getMeta(String htmlEntity){String charset = null;if(htmlEntity == null)return charset;Pattern pattern = Pattern.compile("charset=\"?.*?\"");String[] lines = htmlEntity.split("(\n|\r\n)+");for(int i = 0; i < lines.length; i++){Matcher matcher = pattern.matcher(lines[i]);if(matcher.find()){String[] items = matcher.group().split("=");charset = items[1].replaceAll("\"", "");break;}}return charset;}public String getCharset(HttpEntity entity){String charset = null;if(entity == null)return charset;if(entity.getContentType() != null){HeaderElement[] values = entity.getContentType().getElements();if(values != null && values.length > 0){for(HeaderElement value : values){NameValuePair param = value.getParameterByName("charset");if(param != null){charset = param.getValue();break;}}}}return charset;}}

0 0