网页抓取

来源:互联网 发布:淘宝直播用什么摄像头 编辑:程序博客网 时间:2024/05/17 18:48

上数据挖掘课,数据准备部分考虑这样做:根据配置文件打开相应的网址并保存。之后再对这些文件进行内容解析、文本提取、矩阵转换、聚类等。

public static void main(String[] args){    final int THREAD_COUNT=5;    String baseUrl=null;      String searchBlogs=null;      String blogs[]=null;      String fileDir=null;      //String category=null;    InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties");      Properties p = new Properties();         try {          p.load(inputStream);          baseUrl=p.getProperty("baseUrl");          fileDir=p.getProperty("fileDir");          searchBlogs=p.getProperty("searchBlogs");          if(searchBlogs!=""){              blogs=searchBlogs.split(";");          }        ExecutorService pool=Executors.newFixedThreadPool(THREAD_COUNT);                for(String s:blogs){        pool.submit(new SaveWeb(baseUrl+s,fileDir+"/"+s+".html"));        }          pool.shutdown();                //category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8");      } catch (IOException e) {          e.printStackTrace();      }}


打开网页并保存模块:

public class SaveWeb implements Runnable{    private String url;      private String filename;        public SaveWeb(String url,String filename){    this.url=url;          this.filename=filename;    }    @Overridepublic void run() {HttpClient httpclient = new DefaultHttpClient();          HttpGet httpGet = new HttpGet(url);              httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");              try{              HttpResponse response = httpclient.execute(httpGet);              HttpEntity entity = response.getEntity();              BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename));                      if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){                  if (entity != null) {                         String res=EntityUtils.toString(entity,"UTF-8");                      outputStream.write(res.getBytes("UTF-8"));                      outputStream.flush();                  }              }              outputStream.close();          }catch(IOException e){              e.printStackTrace();          }  }}

后续:

作业完成了,但几乎和上面的内容没啥关系,本来想全删了。再想也不算写错,只是没用上而已,还是留着吧。

最终,用java代码循环加并发去获得一个地址列表存到文件里。而采用R语言去做的挖掘工作。包括获取网页、解析正文、分词、聚类、结果输出等。R语言真是省事,几十行代码全搞定了。但最终分类的结果不理想。看来基于全文的计算特征不明显,划分出来的类也很不准确,还得考虑改进。

0 0
原创粉丝点击