Hi all, last month I got a work of mirroring a site. That site hosts large # of images, which I
started saving, as this work goes boredom, so thought of writing a downloader
for that site. The thing is we can make a site to offline, using HttpCopier, [http://www.httrack.com/]. But am curious enough to write my
own utility, so explored the options, as usual found many classes [HttpWebRequest,
HttpClient, WebClient], I left to the reader’s option for seeing their
differences, and their usage. Let’s see the code snippet below.
I. Using
HtmlAgilityPack [html parser written in .NET]:
1. Make a web request to the site’s images folder, and
get the content :
string baseURL =
"http://xyz.com/images"; //
thank god, images folder open to web J
WebClient client = new WebClient();
string content =
client.DownloadString(baseURL);
byte[] array =
Encoding.ASCII.GetBytes(content);
2.. Store the obtained content in a file:
string path = xyzImagesFile.html";
StreamWriter swObj = new
StreamWriter(path);
swObj.Write(content);
swObj.Close();
portion of html file & its html
view.
3. Parse each node
with <a> contains, get the images thereafter.
HtmlAgilityPack.HtmlDocument
doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(path);
HtmlNodeCollection nodeCollection =
doc.DocumentNode.Element("html").ChildNodes["body"].ChildNodes["ul"].ChildNodes;
WebClient
fileDownloadObj = null;
for (int i = 0; i < nodeCollection.Count;
i++)
{
HtmlNode linodes =
nodeCollection[i];
if (nodeCollection[i].Name ==
"li")
{
string fileName =
linodes.ChildNodes[0].Attributes["href"].Value;
string localPath =
@"D:\Scrap\Images\" + fileName;
fileDownloadObj.DownloadFile(baseURL + "/" + fileName,
localPath);
fileDownloadObj.Close();
}
}
II. Using Regular
Expression:
Here the
first request, of getting the .html file is same, but we don’t need to save,
and parse, here we can extract the exact nodes (ie., <a>) through regex
matching. Hence the 2nd Step in the previous approach itself is not
needed.
public static string GetDirectoryListingRegexForUrl(string url)
{
if (url.Equals(remoteUri))
{
return "<a
href=\".*\">(?<name>.*)</a>";
}
throw new NotSupportedException();
}
public static void Main(String[] args)
{
WebClient client = new WebClient();
string url = remoteUri;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{ using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{ string html = reader.ReadToEnd();
string uurl = GetDirectoryListingRegexForUrl(url);
Regex regex = new Regex(uurl);
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
string filename = match.Groups["name"].ToString();
Console.WriteLine(filename);
client.DownloadFile(remoteUri, @"D:\Scrap\Images\" + filename);
}
}
}
}
} }
Undoubtedly,
the second one is handy. As, most of us don’t familiar with the regular expressions (advanced), learn it and apply it, I’m sure that, this logic tool will makes you real pro.
Happy Coding J
wget -r -A=.jpg,.png http://website.com
ReplyDeletedo not reinvent the wheel.
But, the title is to make ppl, aware of the various tools, and power of reg ex.. Anyhow thanks for sharing, and posting first comment on my blog J..
ReplyDelete