利用Regex查詢Html/Xml標籤中的屬性值

来源：互联网发布：ncstudio软件下载编辑：程序博客网时间：2024/06/17 05:16

作用：取得HTML或XML內容中，某個標籤下所指定的屬性值。
輸入參數：

strHtml(string)：HTML或XML的內容。
strTagName(string)：標籤名。
strAttributeName(string)：屬性名。

函式的程式碼：(寫成static以方便使用)

view source
print?
01public static string[] GetAttribute(string strHtml, string strTagName, string strAttributeName) 
02{ 
03    List<string> lstAttribute = new List<string>(); 
04    string strPattern = string.Format("<//s*{0}//s+.*?(({1}//s*=//s*/"(?<attr>[^/"]+)/")|({1}//s*=//s*'(?<attr>[^']+)')|({1}//s*=//s*(?<attr>[^//s]+)//s*))[^>]*>"
05        , strTagName 
06        , strAttributeName); 
07    MatchCollection matchs = Regex.Matches(strHtml, strPattern, RegexOptions.IgnoreCase); 
08    foreach (Match m in matchs) 
09    { 
10        lstAttribute.Add(m.Groups["attr"].Value); 
11    } 
12    return lstAttribute.ToArray(); 
13}

使用方式：(抓取某個網頁下的所有<img>的src屬性)

view source
print?
01//要抓取的網頁 
02string Url = "http://www.gov.tw/"; 
03HttpWebRequest webReq = (HttpWebRequest)WebRequest.Create(Url); 
04using (HttpWebResponse webResp = (HttpWebResponse)webReq.GetResponse()) 
05{ 
06    //判斷是否有指定編碼(預設用codepage=950) 
07    Encoding encPage = webResp.ContentType.IndexOf("utf-8", StringComparison.OrdinalIgnoreCase) > 0 ? Encoding.UTF8 : Encoding.GetEncoding(950); 
08    using (StreamReader reader = new StreamReader(webResp.GetResponseStream(), encPage)) 
09    { 
10        string strContent = reader.ReadToEnd(); 
11        //列出所有<img>裡的src屬性值 
12        string[] aryValue = GetAttribute(strContent, "img", "src"); 
13        for (int i = 0; i < aryValue.Length; i++) 
14        { 
15            Console.WriteLine(aryValue[i]); 
16        } 
17    } 
18}

註：

XML可以透過XPath去找，透過DOM的方式比較有OO的感覺。
這樣的資料抓取沒避掉附註掉的標籤(<!— xxxxx -->)。
如果是抓src或href的話，該屬性值最好再轉換過一次，將相對路徑換成絕對路徑。
目前這樣的Regex適用於以下幾種情況：
1. 用雙引號框住的屬性值：src="12345678"
2. 用單引號框住的屬性值：src='12345678'
3. 沒用雙引號或單引號框住的屬性值：src=12345678