android 从网页上爬取数据

来源：互联网发布：上海drs数据修复中心编辑：程序博客网时间：2024/06/05 02:58

1.爬取字符串文本经常通过下面三种方法

（1）通过HttpURLConnection爬取文本

①通过url得到HttpUrlConnection的对象httpUrlConnection。

②得到响应码判断是否获取成功。

③将httpUrlConnection.getInputSream()的字节流对象转化为字符流InputStreamReader对象is。

④通过is的read()方法获取文本。

/**         * HttpUrlConnection         */        new Thread(new Runnable() {            @Override            public void run() {                URL url = null;                try {                    url = new URL("http://lol.qq.com/web201310/info-heros.shtml");                } catch (MalformedURLException e) {                    e.printStackTrace();                }                try {                    HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection();                    if(httpURLConnection.getResponseCode() == HttpURLConnection.HTTP_OK) {                        InputStreamReader is = new InputStreamReader(httpURLConnection.getInputStream());                        int i = 0;                        StringBuffer sb = new StringBuffer();                        while ((i = is.read()) != -1 ) {                            sb.append((char) i);                        }//                        Log.d("TAG",sb.toString());                        Message msg = new Message();                        Bundle bundle = new Bundle();                        byte[] bytes = sb.toString().getBytes("utf-8");                        String str = new String(bytes);                        bundle.putString("stringUrl", str);                        msg.setData(bundle);                        msg.what = 0x123;                        myHandler.sendMessage(msg);                    } else {                        Log.d("TAG httpUrlConnection : ",httpURLConnection.getResponseCode() +"");                    }                } catch (IOException e) {                    e.printStackTrace();                }            }        }).start();

效果图:

(2)通过HttpClient爬取数据

①创建HttpClient对象client。

②通过url获取HttpGet请求对象 get;

③获取字符串类型的ResponseHandler(响应处理器)对象.

④调用client.execute(get,responseHandler)方法获取字符串文本。

/** * HttpClient */new Thread(new Runnable() {    @Override    public void run() {        try {            HttpClient client = new DefaultHttpClient();            HttpGet get = new HttpGet("http://lol.qq.com/web201310/info-heros.shtml");            ResponseHandler<String> responseHandler = new BasicResponseHandler();            String content = client.execute(get, responseHandler);            if(content.equals("")) {                Toast.makeText(DataActivity.this, "null", Toast.LENGTH_SHORT).show();            }            Message msg = new Message();            Bundle bundle = new Bundle();            bundle.putString("stringUrl",content);            msg.setData(bundle);            msg.what = 0x123;            myHandler.sendMessage(msg);        } catch (Exception e) {            e.printStackTrace();        }    }}).start();

效果图:

(3)通过jsoup爬取数据,这里使用异步加载数据，除了爬取文本外，经常通过jsoup去获取具体的数据，如下我们要爬取字符串有：所有英雄、战士、法师、刺客、坦克、射手、辅助。

public class LoadHtml extends AsyncTask<String,String,String> {    Document doc;//建立一个Document对象    String url ;    CallBack callBack;//接口回调    private List<String> mListTitle = new ArrayList<>();    public LoadHtml(CallBack callBack,String url) {        this.url = url;        this.callBack = callBack;    }    @Override    protected String doInBackground(String... params) {        try {            doc = Jsoup.connect(url).timeout(5000).post();//doc.string()为该url的文本字符串            Document document = Jsoup.parse(doc.toString());            Elements element = document.select("#seleteChecklist");//取得标题所在<ul>的id值,通过seleteChecklist进行过滤。                       Document document1 = Jsoup.parse(element.toString());            Elements elements = document1.getElementsByTag("li");                        if(elements == null) {                Log.d("TAG","elements为空");            }            for(Element links : elements) {                                String title = links.getElementsByTag("label").text();                mListTitle.add(title);//得到字符串列表（所有英雄、战士...）            }        } catch (IOException e) {            e.printStackTrace();        }        return null;    }    @Override    protected void onPostExecute(String s) {        super.onPostExecute(s);        Log.d("TAG", "onPostExecute");        Log.d("TAG","listSize : "+mListTitle.size());        for(int i=0;i < mListTitle.size();i++) {            String title = mListTitle.get(i);    Log.d("TAG","title : "+title);
        }        if(mListTitle !=null) {            callBack.solve(mListTitle);//当获取到具体数据列表时调用回调函数        }          }      }

效果图:

solve()为自定义接口CallBack的方法，需要数据的类（A类），只需实现该接口，重写该方法即可。LoadHtml类(B类)的构造方法中的callBack为A(该类继承了CallBack接口)为当获取到信息数据后，调用回调接口函数的slove()方法即可将数据返回到需要该数据的类。

0 0