简易网络爬虫程序的开发(1)(c#版)

来源:互联网 发布:唯唯诺诺的性格知乎 编辑:程序博客网 时间:2024/05/16 12:10

给大家共享下自己写的一个简易网络爬虫程序,本程序分为两部分:spider程序集与spiderserver windows服务程序,其中spider程序对爬虫程序的线程管理与获取网页的html做了封装。先看看这个程序的类图吧:

 

下面我对这些类一一介绍:

HttpServer类

该类中只有一个方法public string GetResponse(string url)功能是对指定的url获取该页面的html,实现该功能必须解决以下几个问题:

1.如何获取指定url的html?

其实实现该功能很简单,在C#中通过HttpWebResponse类的调用就能实现,具体方法是:

 HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

Stream reader = response.GetResponseStream();

然后从reader流中读取内容就行了

2.编码问题,网页通常使用utf-8或gb2312进行编码,如果程序在读取流时使用了错误的编码会导致中文字符的错误

为解决这个问题我先看一小段某网页的html

<html><head><meta http-equiv=Content-Type content="text/html;charset=gb2312"><title>百度一下,你就知道   

在标签<meta>中会指定该页面的编码:charset=gb2312,所以我的程序中要先读取charset的值,然后再重新按charset的值对读流进行读取,为了使这个过程更加简单,我先统一按"gb2312"进行编码,从流中读取html,在分析html的charset值,如果该值也是"gb2312"就直接返回html,如果是其它编码就重新读取流。

3.对于有些页面的html可能会非常大所以我们要限制大小,在程序中最在读取不会超过100k

该类完整代码如下:

 /// <summary>
    /// HTTP服务类
    /// </summary>
    internal class HttpServer
    {
        /// <summary>
        /// 获取指定页面html文本
        /// </summary>
        /// <param name="url">页面url</param>
        public string GetResponse(string url)
        {
            try
            {
                string html = string.Empty;
                string encoding = string.Empty;
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                request.Method = "get";
                request.ContentType = "text/html";
                byte[] buffer = new byte[1024];
                using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
                {
                    using (Stream reader = response.GetResponseStream())
                    {
                        using (MemoryStream memory = new MemoryStream())
                        {
                            int index = 1;
                            int sum = 0;

                           //限制的读取的大小不超过100k
                            while (index > 0 && sum < 100 * 1024)
                            {
                                index = reader.Read(buffer, 0, 1024);
                                if (index > 0)
                                {
                                    memory.Write(buffer, 0, index);
                                    sum += index;
                                }
                            }
                            html = Encoding.GetEncoding("gb2312").GetString(memory.ToArray());
                            if (string.IsNullOrEmpty(html))
                            {
                                return html;
                            }
                            else
                            {

                                //解析charset的值
                                Regex re = new Regex(@"charset=(?<charset>[/s/S]*?)""");
                                Match m = re.Match(html.ToLower());
                                encoding = m.Groups["charset"].ToString();
                            }

                            if (string.IsNullOrEmpty(encoding) || string.Equals(encoding.ToLower(), "gb2312"))
                            {
                                return html;
                            }
                            else
                            {

                                //不是gb2312编码则按charset值的编码进行读取
                                return Encoding.GetEncoding(encoding).GetString(memory.ToArray());
                            }
                        } 
                    }
                }
            }
            catch
            {
                return "";
            }
        }
    }
}

由于在程序外该类是不可见的,所以声明时用了internal.

未完,待续……

原创粉丝点击