asp.net 动态抓取网站数据(方法一)

来源:互联网 发布:软件售前工程师 编辑:程序博客网 时间:2024/04/30 13:45

方法一:抓取网站本身就是要正则表达式去匹配的,然后把你想要的信息用正则表达式匹配出来在写到数据库去的 Demo: public static string GetNewsUrl(string strUrl) { string str = ""; HttpWebRequest request = (HttpWebRequest)WebRequest.Create(strUrl); request.Timeout = 30000; request.AllowAutoRedirect = false; request.KeepAlive = false; request.ProtocolVersion = HttpVersion.Version11; request.Headers["Accept-Language"] = "zh-cn"; request.UserAgent = "mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; .net clr 1.0.3705; .net clr 2.0.50727; .net clr 1.1.4322)"; try { WebResponse response = request.GetResponse(); System.IO.Stream stream = response.GetResponseStream(); System.IO.StreamReader sr = new System.IO.StreamReader(stream, Encoding.GetEncoding("UTF-8")); //System.IO.StreamReader sr = new System.IO.StreamReader(stream, Encoding.GetEncoding("gb2312")); str = sr.ReadToEnd(); stream.Close(); sr.Close(); } catch (Exception e) { str = e.ToString().Replace("/n", "
"); } return str; } 上面是一个方法:在调用的时候给个参数 如: string strUrl=“此处为你要抓取的网站链接”; 这个是正则表达式匹配表的(table 、tr、td等等) private string GetValue(string str) { string s = DelSpan(str, 1);//删除,,字符,(注:由内向外删除,先删除离最终要提取数据最近的那些字符标签,然后再由内向外删除其相应的字符标签) //string parten = @"]*)>[^>]*";//查找以   开始到以结束的字符 string parten = @"]*>(?:(?:/s|/S)*?(?=)(?(]*>(?:/s|/S)*?(?:|(?:(?:]*>(?:/s|/S)*?(?:/s|/S)*?)*?))(?:/s|/S)*?|))*"; Regex reg = new Regex(parten, RegexOptions.IgnoreCase | RegexOptions.Compiled); MatchCollection mc = reg.Matches(s); //收集以   开始到以结束的字符 s = ""; foreach (Match m in mc) { s += m.Value.Replace(" ", "").Trim() + "|"; } s = DelSpan(s, 2).Trim();//删除所有""之间的字符,即   之间的字符 s = s.Replace("", "");//删除所有字符 if (s.IndexOf('|') == 0) s = s.TrimStart('|'); return GetValues(s); }update

原创粉丝点击