Automatically Grab Images From a Website With C#
来源:互联网 发布:个人发卡3.3源码 编辑:程序博客网 时间:2024/05/18 02:55
Requirement definition
The application should be able to:
- Retrieve the HTML markup in text from a given URL, through HTTP request;
- Extract the URI of the image from the HTML markup;
- Save the image from its URI to local disk.
First of all, let's set up a console application in VS 2012:
Get HTML markup
Googling for a while led me to the post: Get HTML code from a website C#. There are plenty of methods to do that, in different levels. But for such a typical common task, there are always convenient shortcuts.This answer is quite eye-catching, so I went to http://fizzlerex.codeplex.com/.
I downloaded the zip file and extracted the pack into:
Add the references of the dll files:
Just add these two dlls, if you add Fizzler.dll, you will get a compilation error later. According to this site's instruction, I come to the code which reads like:
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;// Fizzlerusing HtmlAgilityPack;using Fizzler.Systems.HtmlAgilityPack;// endclass Program{static void Main(string[] args){string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";HtmlWeb web = new HtmlWeb();HtmlDocument document = web.Load(httpURL);HtmlNode page = document.DocumentNode;var items = page.QuerySelectorAll("img");//...Console.ReadLine();}}
It doesn't work, when the execution reaches page.QuerySelectorAll(), it gives an exception, which is documented very well here: http://stackoverflow.com/questions/29053667/fizzelerex-throws-an-exception-when-trying-to-web-scrape-a-website-in-c-sharp-20, not a mature solution apparently.
However, at least we can still make use of its another property to get the HTML markup:
class Program{static void Main(string[] args){string httpURL = "http://www.torlundvall.com/gallery.asp?start=1&detail=1";HtmlWeb web = new HtmlWeb();HtmlDocument document = web.Load(httpURL);HtmlNode page = document.DocumentNode;string markup = page.InnerHtml;Console.ReadLine();}}
Scrape HTML markup
To this end, we cannot count on Fizzler any more. We have to resort to old fashion: Regular Expression.
But before that, let's take a close look at the site we are going to scrape:
Notice the URL, the only part that matters is the value of query string detail, so its pattern can be referred as:
http://www.torlundvall.com/gallery.asp?start=1&detail=<number>
And its HTML is:
<html><head><title>Gallery</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><style>.maintext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:000000;}.soldtext {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: none; color:#CC0000;}A {font-family:Arial, Verdana, Helvetica, sans-serif; font-size:12px; TEXT-DECORATION: Underline; color:000000;}</style><script LANGUAGE="JavaScript">var popup_url = 'photo_popup.asp?photo=';var windowvars;var popupwindow = null;var popupwindow_open = false;function popup(img,imgw,imgh) {if (popupwindow_open) { closePopupwindow(); }windowvars = 'menubar=0,scrollbars=0,toolbar=0,location=0,resizeable=0,width='+imgw+',height='+imgh+',top=100'; popupwindow = window.open(popup_url+img,"PhotoPopUp",windowvars);popupwindow_open = true;if (window.focus) {popupwindow.focus();}}function closePopupwindow() { if (popupwindow != null) { if (popupwindow_open) { popupwindow_open = false; popupwindow.close(); } }}</script></head><body bgcolor="#FFFFFF" leftmargin="0" marginwidth="0" marginheight="0"><table id="Table_01" width="550" height="768" border="0" cellpadding="0" cellspacing="0" align="center"><tr><td colspan="15"><img src="images/gallery_01.jpg" width="550" height="16" alt=""></td></tr><tr><td rowspan="2" background="images/gallery_02_tall.jpg"><img src="images/gallery_02.jpg" width="17" height="572" alt=""></td><td><a href="news.html"><img src="images/gallery_03.jpg" width="56" height="64" title="news" border="0"></a></td><td><img src="images/gallery_04.jpg" width="24" height="64" alt=""></td><td><a href="gallery.asp"><img src="images/gallery_05.jpg" width="56" height="64" title="gallery" border="0"></a></td><td><img src="images/gallery_06.jpg" width="23" height="64" alt=""></td><td><a href="discography.html"><img src="images/gallery_07.jpg" width="54" height="64" title="discography" border="0"></a></td><td><img src="images/gallery_08.jpg" width="22" height="64" alt=""></td><td><a href="ps.asp"><img src="images/gallery_09.jpg" width="56" height="64" title="personal statement" border="0"></a></td><td><img src="images/gallery_10.jpg" width="21" height="64" alt=""></td><td><a href="cv.asp"><img src="images/gallery_11.jpg" width="55" height="64" title="curriculum vitae" border="0"></a></td><td><img src="images/gallery_12.jpg" width="18" height="64" alt=""></td><td><a href="photos.html"><img src="images/gallery_13.jpg" width="56" height="64" title="photos" border="0"></a></td><td><img src="images/gallery_14.jpg" width="20" height="64" alt=""></td><td><a href="links.html"><img src="images/gallery_15.jpg" width="56" height="64" title="links" border="0"></a></td><td rowspan="2" background="images/gallery_16_tall.jpg"><img src="images/gallery_16.jpg" width="16" height="572" alt=""></td></tr><tr><td colspan="13" background="images/gallery_bg_stretched.jpg" valign="top"><table cellpadding="0" cellspacing="0" border="0" align="center"><!--show detail--><tr><td height="5" width="30"></td><td height="5" width="457"></td><td height="5" width="30"></td></tr><tr><td colspan="3" align="center" height="470" valign="middle"><a href="gallery.asp?start=1"><img src="images/paintings/tl-waiting_1.jpg" border="0"></a></td></tr><tr><td align="left"><!--<a href="gallery.asp?detail=0"><font face="Arial, Verdana" size="2"><br>back</font></a>--></td><td align="center"><a href="gallery.asp?start=1"><font face="Arial, Verdana" size="2"><br>return to gallery</font></a></td><td align="right"><!--<a href="gallery.asp?detail=2"><font face="Arial, Verdana" size="2"><br>next</font></a>--></td></tr></table> </td></tr><!--<tr> <td background="images/gallery_02.jpg"></td><td colspan="13" background="images/gallery_flipped_bg.jpg" align="center"> <a href="javascript:popup('bio_gallery.jpg',533,404);" alt=""><img src="images/discography/details_button.jpg" border="0"></a></td> <td background="images/gallery_17.jpg"></td></tr>--> <tr> <td colspan="15"> <img src="images/gallery_18.jpg" width="550" height="180" alt=""></td> </tr></table></body></html>
The only information we need is the src attribute of the <img> tag most inside:
<img src="images/paintings/tl-waiting_1.jpg" border="0">
And the name of the image file could contain lower letters, digits, and hyphen and underscore. According to the tutorial: http://www.dotnetperls.com/regex-match, I come to this:
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;// Fizzlerusing HtmlAgilityPack;using Fizzler.Systems.HtmlAgilityPack;// end// Regular Expressionusing System.Text.RegularExpressions;// endclass Program{static void Main(string[] args){string imgWebDir = "http://www.torlundvall.com/images/paintings/";string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";string imgSrcPattern = @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";string imgExtensionName = ".jpg";HtmlWeb web = new HtmlWeb();for (int i = start; i <= end; i++){HtmlDocument document = web.Load(httpURL + i.ToString());HtmlNode page = document.DocumentNode;string markup = page.InnerHtml;Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);if (match.Success){string fileName = match.Groups[1].Value;string fullImgURL = imgWebDir + fileName + imgExtensionName;//...}Console.ReadLine();}}
References:
Regex.Match Method (String)
Escape quotes in a C# Regex pattern
Save Image
How to download image from url using c#
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;// Fizzlerusing HtmlAgilityPack;using Fizzler.Systems.HtmlAgilityPack;// end// Regular Expressionusing System.Text.RegularExpressions;// end// WebClientusing System.Net;// endclass Program{static void Main(string[] args){string imgWebDir = "http://www.torlundvall.com/images/paintings/";string httpURL = "http://www.torlundvall.com/gallery.asp?start=144&detail=";string imgSrcPattern = @"<img src=""images/paintings/([A-Za-z0-9\-_]+).jpg""";string imgExtensionName = ".jpg";string imgSaveDir = @"C:\torlundvall\";HtmlWeb web = new HtmlWeb();for (int i = start; i <= end; i++){HtmlDocument document = web.Load(httpURL + i.ToString());HtmlNode page = document.DocumentNode;string markup = page.InnerHtml;Match match = Regex.Match(markup, imgSrcPattern, RegexOptions.IgnoreCase);if (match.Success){string fileName = match.Groups[1].Value;string fullImgURL = imgWebDir + fileName + imgExtensionName;using (WebClient webClient = new WebClient()){webClient.DownloadFile(fullImgURL, imgSaveDir + fileName + imgExtensionName);}}Console.ReadLine();}}
References:
WebClient.DownloadFile Method (Uri, String)
- Automatically Grab Images From a Website With C#
- Update: A New & Improved jQuery Script to Automatically Preload images from CSS
- Grab a picture from a FLV file
- A class to grab pictures from your camera
- Creating a website from the command line
- Extracting Single Images from a CImageList object
- Ten things to do with IIS(From CodeProject website)
- Ten things to do with IIS(From CodeProject website)
- read back raw images from storage with firehose
- VC and Winsocket Programming - Downloading File From a Website
- Grab a SQL Image data type with the Connector, ODBC or Microsofts ADO
- RailsSpace: Building a Social Networking Website with Ruby on Rails
- 8 Things That Grab and Hold Website Visitor’s Attention
- SharePoint 2010 Cookbook: Backup a Site Collection Automatically with a PowerShell Script
- Download from other website
- some words from website
- 项目管理实践【五】自动编译和发布网站【Using Visual Studio with Source Control System to build and publish website automatically】
- 项目管理实践【五】自动编译和发布【Using Visual Studio with Source Control System to build and publish website automatically】
- Seoul 2007 Meteor 流星 ,LA 3905
- Android 开发中的日常积累
- stl string常用函数
- 第一次面试感受
- webvtt字幕转srt字幕方法
- Automatically Grab Images From a Website With C#
- java实战
- 读《少有人走的路》(一)
- 学习笔记--linq链接数据库的几种方法
- jenkins插件开发
- uva 12657
- html 自动生成数字------document.write(c) js写在 html里面
- Notepad++颜色配置
- java面试题二十三 接口