.net对html的抓取

来源：互联网发布：阿里云快照恢复编辑：程序博客网时间：2024/05/17 16:02

最近在项目里需要用到一个小功能，很小很小的功能，就是抓取指定网站上的内容，但是这个网站刚好是用Ajax进行加载主要内容的，这时按平时用的webclient等去抓取是无法抓取到Ajax加载的内容的，用httpwatch等分析后再模拟Post，这个成本有点高了，因为这只是一个很小的功能而已。

在博客园上发问后得到了几个方法，有用nodejs的，有用Selenium的，有用phantomjs的，开始对nodejs进行研究，看了一些介绍，也都放弃了，开发指南_中文正版.pdf，这个文档是其中文版说明。

下面说一下其他的两个方法：

Selenium，这个selenium-dotnet-strongnamed-2.35.0.zip是资源包，3.5和4.0的框架两个都在里面了，这个要配合不同的Driver来进行页面的操作，比如IEDriverServer.exe，还可以是其他的，比如Firefox，Chrome等，可以到Selenium的官网进行了解下载。

其中Selenium需要的dll和Driver都引入到项目中，我就放于Bin中，主要代码如下：

protected void Button1_Click(object sender, EventArgs e)
{
// Create a new instance of the Firefox driver.

// Notice that the remainder of the code relies on the interface,
// not the implementation.

// Further note that other drivers (InternetExplorerDriver,
// ChromeDriver, etc.) will require further configuration
// before this example will work. See the wiki pages for the
// individual drivers at http://code.google.com/p/selenium/wiki
// for further information.
IWebDriver driver = new InternetExplorerDriver();
//HtmlUnitDriver driver = new HtmlUnitDriver(true);

//Notice navigation is slightly different than the Java version
//This is because 'get' is a keyword in C#
driver.Navigate().GoToUrl("http://data.shishicai.cn/cqssc/haoma/");

// Find the text input element by its name
//IWebElement query = driver.FindElement(By.Name("q"));

// Enter something to search for
//query.SendKeys("Cheese");

// Now submit the form. WebDriver will find the form for us from the element
//query.Submit();

// Google's search is rendered dynamically with JavaScript.
// Wait for the page to load, timeout after 10 seconds
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
//wait.Until((d) => { return d.Title.ToLower().StartsWith("cheese"); });

// Should see: "Cheese - Google Search"
TextBox1.Text = "Page title is: " + driver.PageSource;

//Close the browser
driver.Quit();

}

这样可以拿到网页内容，但是这期间会弹出dos窗口和Driver的界面，执行完后就会自动关闭，这是我不太满意的地方，不知道有没有什么方法可以解决。

phantomjs，这个可以直接到官网下载也可在这里下载，我下载到的是phantomjs-1.9.1-windows这个版本，解压后如下：

其中，第一个是例子，exe这个执行文件是一个运行js的平台，这个可以直接在cmd中进行调用，不记得在哪一个文章中提到在cmd中要定位到这个执行文件所在的目录后才可以执行相应的js的，不然不会成功，果真如此，打开cmd，然后直接cd phantomjs.exe所在的目录，然后执行 phantomjs hello.js可以看到结果。如果下我执行的是一个examples下的一个截图的脚本rasterize.js

其中http://www.hao123.com和hao123.png是这个脚本的参数，即对传入地址截图存为hao123.png，这样的话就可以像Selenium指导所需要的东西所入项目中，然后调cmd进行执行，这个的好处是快，而且没有弹出dos窗口和浏览器窗口。