Automatically Grab Images From a Website With C#

来源:互联网 发布:个人发卡3.3源码 编辑:程序博客网 时间:2024/05/18 02:55

Requirement definition


The application should be able to:

  1. Retrieve the HTML markup in text from a given URL, through HTTP request;
  2. Extract the URI of the image from the HTML markup;
  3. Save the image from its URI to local disk.


First of all, let's set up a console application in VS 2012:



Get HTML markup


Googling for a while led me to the post: Get HTML code from a website C#. There are plenty of methods to do that, in different levels. But for such a typical common task, there are always convenient shortcuts.This answer is quite eye-catching, so I went to http://fizzlerex.codeplex.com/.


I downloaded the zip file and extracted the pack into:



Add the references of the dll files:



Just add these two dlls, if you add Fizzler.dll, you will get a compilation error later. According to this site's instruction, I come to the code which reads like:


using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;// Fizzlerusing HtmlAgilityPack;using Fizzler.Systems.HtmlAgilityPack;// endclass Program{static void Main(string[] args){string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";HtmlWeb web = new HtmlWeb();HtmlDocument document = web.Load(httpURL);HtmlNode page = document.DocumentNode;var items = page.QuerySelectorAll("img");//...Console.ReadLine();}}


It doesn't work, when the execution reaches page.QuerySelectorAll(), it gives an exception, which is documented very well here: http://stackoverflow.com/questions/29053667/fizzelerex-throws-an-exception-when-trying-to-web-scrape-a-website-in-c-sharp-20, not a mature solution apparently.


However, at least we can still make use of its another property to get the HTML markup:


class Program{static void Main(string[] args){string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";HtmlWeb web = new HtmlWeb();HtmlDocument document = web.Load(httpURL);HtmlNode page = document.DocumentNode;string markup = page.InnerHtml;Console.ReadLine();}}


Scrape HTML markup


To this end, we cannot count on Fizzler any more. We have to resort to old fashion: Regular Expression.

But before that, let's take a close look at the site we are going to scrape:



Notice the URL, the only part that matters is the value of query string detail, so its pattern can be referred as:

http://www.torlundvall.com/gallery.asp?start=1&detail=<number>


And its HTML is:

<html><head><title>Gallery</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><style>.maintext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:000000;}.soldtext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:#CC0000;}A {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: Underline; color:000000;}</style><script LANGUAGE="JavaScript">var popup_url = 'photo_popup.asp?photo=';var windowvars;var popupwindow = null;var popupwindow_open = false;function popup(img,imgw,imgh) {if (popupwindow_open) {        closePopupwindow();    }windowvars = 'menubar=0,scrollbars=0,toolbar=0,location=0,resizeable=0,width='+imgw+',height='+imgh+',top=100';    popupwindow = window.open(popup_url+img,"PhotoPopUp",windowvars);popupwindow_open = true;if (window.focus) {popupwindow.focus();}}function closePopupwindow() {    if (popupwindow != null) {        if (popupwindow_open) {            popupwindow_open = false;            popupwindow.close();        }    }}</script></head><body bgcolor="#FFFFFF" leftmargin="0" marginwidth="0" marginheight="0"><table id="Table_01" width="550" height="768" border="0" cellpadding="0" cellspacing="0" align="center"><tr><td colspan="15"><img src="images/gallery_01.jpg" width="550" height="16" alt=""></td></tr><tr><td rowspan="2" background="images/gallery_02_tall.jpg"><img src="images/gallery_02.jpg" width="17" height="572" alt=""></td><td><a href="news.html"><img src="images/gallery_03.jpg" width="56" height="64" title="news" border="0"></a></td><td><img src="images/gallery_04.jpg" width="24" height="64" alt=""></td><td><a href="gallery.asp"><img src="images/gallery_05.jpg" width="56" height="64" title="gallery" border="0"></a></td><td><img src="images/gallery_06.jpg" width="23" height="64" alt=""></td><td><a href="discography.html"><img src="images/gallery_07.jpg" width="54" height="64" title="discography" border="0"></a></td><td><img src="images/gallery_08.jpg" width="22" height="64" alt=""></td><td><a href="ps.asp"><img src="images/gallery_09.jpg" width="56" height="64" title="personal statement" border="0"></a></td><td><img src="images/gallery_10.jpg" width="21" height="64" alt=""></td><td><a href="cv.asp"><img src="images/gallery_11.jpg" width="55" height="64" title="curriculum vitae" border="0"></a></td><td><img src="images/gallery_12.jpg" width="18" height="64" alt=""></td><td><a href="photos.html"><img src="images/gallery_13.jpg" width="56" height="64" title="photos" border="0"></a></td><td><img src="images/gallery_14.jpg" width="20" height="64" alt=""></td><td><a href="links.html"><img src="images/gallery_15.jpg" width="56" height="64" title="links" border="0"></a></td><td rowspan="2" background="images/gallery_16_tall.jpg"><img src="images/gallery_16.jpg" width="16" height="572" alt=""></td></tr><tr><td colspan="13" background="images/gallery_bg_stretched.jpg" valign="top"><table cellpadding="0" cellspacing="0" border="0" align="center"><!--show detail--><tr><td height="5" width="30"></td><td height="5" width="457"></td><td height="5" width="30"></td></tr><tr><td colspan="3" align="center" height="470" valign="middle"><a href="gallery.asp?start=1"><img src="images/paintings/tl-waiting_1.jpg" border="0"></a></td></tr><tr><td align="left"><!--<a href="gallery.asp?detail=0"><font face="Arial, Verdana" size="2"><br>back</font></a>--></td><td align="center"><a href="gallery.asp?start=1"><font face="Arial, Verdana" size="2"><br>return to gallery</font></a></td><td align="right"><!--<a href="gallery.asp?detail=2"><font face="Arial, Verdana" size="2"><br>next</font></a>--></td></tr></table>                </td></tr><!--<tr>    <td background="images/gallery_02.jpg"></td><td colspan="13" background="images/gallery_flipped_bg.jpg" align="center">        <a href="javascript:popup('bio_gallery.jpg',533,404);" alt=""><img src="images/discography/details_button.jpg" border="0"></a></td>        <td background="images/gallery_17.jpg"></td></tr>-->    <tr>    <td colspan="15">        <img src="images/gallery_18.jpg" width="550" height="180" alt=""></td>    </tr></table></body></html>

The only information we need is the src attribute of the <img> tag most inside:

<img src="images/paintings/tl-waiting_1.jpg" border="0">


And the name of the image file could contain lower letters, digits, and hyphen and underscore. According to the tutorial: http://www.dotnetperls.com/regex-match, I come to this:


using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;// Fizzlerusing HtmlAgilityPack;using Fizzler.Systems.HtmlAgilityPack;// end// Regular Expressionusing System.Text.RegularExpressions;// endclass Program{static void Main(string[] args){string imgWebDir = "http://www.torlundvall.com/images/paintings/";string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";string imgSrcPattern =  @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";string imgExtensionName = ".jpg";HtmlWeb web = new HtmlWeb();for (int i = start; i <= end; i++){HtmlDocument document = web.Load(httpURL + i.ToString());HtmlNode page = document.DocumentNode;string markup = page.InnerHtml;Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);if (match.Success){string fileName = match.Groups[1].Value;string fullImgURL = imgWebDir + fileName + imgExtensionName;//...}Console.ReadLine();}}


References:

Regex.Match Method (String)

Escape quotes in a C# Regex pattern


Save Image

How to download image from url using c#


using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;// Fizzlerusing HtmlAgilityPack;using Fizzler.Systems.HtmlAgilityPack;// end// Regular Expressionusing System.Text.RegularExpressions;// end// WebClientusing System.Net;// endclass Program{static void Main(string[] args){string imgWebDir = "http://www.torlundvall.com/images/paintings/";string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";string imgSrcPattern =  @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";string imgExtensionName = ".jpg";string imgSaveDir = @"C:\torlundvall\";HtmlWeb web = new HtmlWeb();for (int i = start; i <= end; i++){HtmlDocument document = web.Load(httpURL + i.ToString());HtmlNode page = document.DocumentNode;string markup = page.InnerHtml;Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);if (match.Success){string fileName = match.Groups[1].Value;string fullImgURL = imgWebDir + fileName + imgExtensionName;using (WebClient webClient = new WebClient()){webClient.DownloadFile(fullImgURL, imgSaveDir + fileName + imgExtensionName);}}Console.ReadLine();}}



References:

WebClient.DownloadFile Method (Uri, String)














0 0
原创粉丝点击