Thursday, 26 September 2013

Image Downloader, HtmlAgilityPack & Power of Regular Expressions

Hi all, last month I got a work of mirroring a site.  That site hosts large # of images, which I started saving, as this work goes boredom, so thought of writing a downloader for that site. The thing is we can make a site to offline, using HttpCopier, [http://www.httrack.com/]. But am curious enough to write my own utility, so explored the options, as usual found many classes [HttpWebRequest, HttpClient, WebClient], I left to the reader’s option for seeing their differences, and their usage. Let’s see the code snippet below.

I. Using HtmlAgilityPack [html parser written in .NET]:

1. Make a web request to the site’s images folder, and get the content :
string baseURL = "http://xyz.com/images";  // thank god, images folder open to web J
WebClient client = new WebClient();
string content = client.DownloadString(baseURL);
byte[] array = Encoding.ASCII.GetBytes(content);

2.. Store the obtained content in a file:
string path = xyzImagesFile.html";
StreamWriter swObj = new StreamWriter(path);
            swObj.Write(content);
            swObj.Close();


Portion of Html File & Its Html Source

portion of html file & its html view.

3.  Parse each node with <a> contains, get the images thereafter.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(path);
HtmlNodeCollection nodeCollection = doc.DocumentNode.Element("html").ChildNodes["body"].ChildNodes["ul"].ChildNodes;
WebClient fileDownloadObj = null; 
 for (int i = 0; i < nodeCollection.Count; i++)
 {
                HtmlNode linodes = nodeCollection[i];
                if (nodeCollection[i].Name == "li")
                {
                    string fileName = linodes.ChildNodes[0].Attributes["href"].Value;
                     string localPath = @"D:\Scrap\Images\" + fileName;
                    fileDownloadObj.DownloadFile(baseURL + "/" + fileName, localPath);
     fileDownloadObj.Close();
                }
  }
II. Using Regular Expression:

Here the first request, of getting the .html file is same, but we don’t need to save, and parse, here we can extract the exact nodes (ie., <a>) through regex matching. Hence the 2nd Step in the previous approach itself is not needed.
static string remoteUri = "http://xyz.com/images";
        public static string GetDirectoryListingRegexForUrl(string url)
        {
            if (url.Equals(remoteUri))
            {
                return "<a href=\".*\">(?<name>.*)</a>";
            }
            throw new NotSupportedException();
        }
        public static void Main(String[] args)
        {
            WebClient client = new WebClient();
            string url = remoteUri;
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
            {   using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {   string html = reader.ReadToEnd();
                    string uurl = GetDirectoryListingRegexForUrl(url);
                    Regex regex = new Regex(uurl);
                    MatchCollection matches = regex.Matches(html);
                    if (matches.Count > 0)
                    {
                        foreach (Match match in matches)
                        {
                            if (match.Success)                                                    
                            {
                                string filename = match.Groups["name"].ToString();
                                Console.WriteLine(filename);
                                client.DownloadFile(remoteUri, @"D:\Scrap\Images\" + filename);
                            }
                        }
                    }
                }
            } }

Undoubtedly, the second one is handy. As, most of us don’t familiar with the regular expressions (advanced), learn it and apply it, I’m sure that, this logic tool will makes you real pro.


Happy Coding J