如何直接提取HTML文档的title

来源：互联网发布：code编程助手app 编辑：程序博客网时间：2024/06/03 23:28

在C#中，一般我们在使用HTML的时候，都是直接使用WebBrower控件将去显示HTML，这很简单，唯一要注意的地方就是Navigate某个URL的时候，有可能我们并没有加载成功就开始使用了。因为navigate是异步的，一调用之后，不等待页面加载完毕就直接返回了.
比如：

WebBrowser webBrowser = new WebBrowser();Uri uri = new Uri("http://www.google.com.hk/");webBrowser.Navigate(uri);String title = webBrowser.DocumentTitle;

此时，通过webBrowser.DocumentTitle取到的值是空字符串。
要想取得该加载的URL的title元素，最简单的方式就是处理WebBrowser的DocumentCompleted事件。代码如下。

static void Main(string[] args){    WebBrowser webBrowser = new WebBrowser();    webBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser_DocumentCompleted);    Uri uri = new Uri("http://www.google.com.hk/");    webBrowser.Navigate(uri);    String title = webBrowser.DocumentTitle;}void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e){    WebBrowser browser = (WebBrowser)sender;    if (browser.ReadyState == WebBrowserReadyState.Complete)    {        String title = browser.DocumentTitle;    }}

现在有一个简单的需求，我们要取得某个已知的HTML文档里面的内容，如何不使用WebBrowser而直接通过读取HTML文件取得。
我们需要用到一个COM组件：Microsoft HTML Object Library. 通过使用该组件中的IHTMLDocument2就能够获取很多信息。
Tip: 右击工程->Add Reference->COM中选择该COM组建，在工程中就能看到MSHTML的引用，实际上，该COM组件最终使用的是X:/Windows/System32下的mshtml.dll。

下面的代码取得了某个.html文件的title元素。如果你想要获取其它信息，可以很简单的修改一下该函数即可。

/// <summary>/// Get the html file's title./// </summary>/// <param name="strURL">Specified html file's URL (Path).</param>/// <returns>If the html file has title, return it, otherwise, return empty string.</returns>public static String GetHTMLTitle(String strURL){    if (null == strURL || strURL.Equals(String.Empty))    {        return String.Empty;    }    try    {        mshtml.HTMLDocumentClass docObject = new mshtml.HTMLDocumentClass();        mshtml.IHTMLDocument2 doc2 = docObject as mshtml.IHTMLDocument2;                // Load the .html file specified by strURL        StreamReader sr = new StreamReader(strURL, Encoding.UTF8);        // Read the .html into a string        String html = sr.ReadToEnd();        doc2.write(html);        doc2.close();        return doc2.title;    }    catch (System.Exception ex)    {    }    return String.Empty;}

下面的这种方式除了能够直接加载某个.html文档外，还能加载某个网站，如：http://www.google.com.hk/

/// <summary>/// Get the html title from the specified URL or local .html file./// </summary>/// <param name="strURL">Specified URL or local .html file (Path).</param>/// <returns>If the html file has title, return it, otherwise, return empty string.</returns>public static String GetHTMLTitle(String strURL){    if (null == strURL || strURL.Equals(String.Empty))    {        return String.Empty;    }    try    {        HTMLDocumentClass rootDocument = new HTMLDocumentClass();        IHTMLDocument2 document2 = rootDocument;        IHTMLDocument4 document4 = rootDocument;        document2.write("<html></html>");        document2.close();        HTMLDocument myDocument = document4.createDocumentFromUrl(strURL, null) as HTMLDocument;        int i = 0;        while (myDocument.readyState != "complete")        {            if (++i > 50)            {                Console.WriteLine("Time Out!");            }            System.Threading.Thread.Sleep(100);            System.Windows.Forms.Application.DoEvents();        }        return myDocument.title;    }    catch (System.Exception ex)    {    }    return String.Empty;}

Reference: http://capsulecorp.studio-web.net/tora9/cs/mshtml/HTMLDocument.html