突破网站页面的下载限制

来源：互联网发布：剑灵的优化知乎编辑：程序博客网时间：2024/05/21 10:25

昨天研究如何下载一个网站某项活动的投票页面（有作弊嫌疑，不要声张）。
一般在.net，可以通过如下代码下载指定url链接返回的数据：

int nSize = 1024*20;
byte[] buf=new byte[nSize];

HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(strUrl);
loHttp.Method = "get";
HttpWebResponse myHttpWebResponse=(HttpWebResponse)loHttp.GetResponse();
myHttpWebResponse.GetResponseStream().Read(buf,0,nSize);
myHttpWebResponse.Close();

string strResult = Encoding.GetEncoding("gb2312").GetString(buf);

可是当我用上述方法下载该页面时，遇到一点障碍，我所得到的strResult的内容是

What do you want?

而不是通过浏览器所能看到的页面内容。看来人家防了一手。用浏览器可以下载并查看页面，
可是通过.net的程序却不行，看来服务器端对客户端的类型做了校验。查阅MSDN，发现了如下
内容：

-------------------------------------------------
HOWTO: Determine Browser Type in Server-Side Script Without the BrowserType Object
SUMMARY
There are two common methods in server-side script to determine information about the browser that is being used by the client:
The BrowserType object
The Request.ServerVariables("HTTP_USER_AGENT") method
This article describes the Request.ServerVariables("HTTP_USER_AGENT") method, which provides more detailed information about the browser than the BrowserType object. For additional information about the BrowserType object (as well as the use of client-side script to obtain browser information), click the article number below to view the article in the Microsoft Knowledge Base:
167820 HOWTO: Determine Browser Version from a Script
...
MORE INFORMATION
The following sample code illustrates the use of Request.ServerVariables("HTTP_USER_AGENT"):
<%
   dim UserAgent

   UserAgent = Request.ServerVariables("HTTP_USER_AGENT")
   Response.Write "<p>" & UserAgent & "</p>"

   if instr(1,UserAgent,"MSIE") > 0 then
      Response.Write "Browser is Internet Explorer"
   else
      if instr(1,UserAgent,"MSPIE") > 0 then
         Response.Write "Browser is Pocket Internet Explorer"
      else
         Response.Write "Browser is not Internet Explorer"
      end if
   end if
%>

-------------------------------------------------

难道这个网站使用的就是这样的方法吗？试试先。
.net的HttpWebRequest对象正好有一个UserAgent属性，看来有门。
在先前的下载代码中加上一行（注意第4行）：

int nSize = 1024*20;
byte[] buf=new byte[nSize];

HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(strUrl);

loHttp.UserAgent = "MSIE"; // 加上这样，伪装成Microsoft IE Browser

loHttp.Method = "get";
HttpWebResponse myHttpWebResponse=(HttpWebResponse)loHttp.GetResponse();
myHttpWebResponse.GetResponseStream().Read(buf,0,nSize);
myHttpWebResponse.Close();

string strResult = Encoding.GetEncoding("gb2312").GetString(buf);

编译代码，重新下载页面，Greate! 文件顺利下载下来了。

对Web页面，服务端想要彻底防止恶意访问，好像也是不太可能的，除非你干脆禁止浏览器的
访问，否则只要硬件通信链路存在，数据有条件访问，就不能随意限制客户端的访问操作。