Friday, July 8, 2011

Screen scraping for non-programmers

One of my favorite tools I made for myself is now located at http://netreputation.co.uk/extractor/

I dont believe in coding in complex Regex every time I have to scrape something off of a web page somewhere. So when I got tired of it I saw a pattern in my detection of patterns and coded myself a simple tool to make life easier. I also wanted it to be such that I could ask anybody to get patterns for a website.

This is for programmers who are not well versed in Regex or even non-programmers who want to get some rudimentary scraping done. Mind you, it is also a lot easier to read than Regex as most of the HTML is preserved in the pattern.

The learning curve is very small. Here's how to use it:

Step 1:  Lets say you want to make an rss feed out of this site http://news.ycombinator.com/

Step 2: Click on View Source in your browser

Step 3: If you look at the source you will see that the "titles" we want is nestled in this piece of HTML

<td class="title"><a href="http://blog.pinboard.in/2011/07/two_years_of_pinboard/">Two Years of Pinboard</a><span class="comhead"> (blog.pinboard.in) </span></td>
Step 4. Replace the "content" part  of the HTML with variables like this. (Since we want an RSS feed we will use the standard {title}, {link} and {description} as variables)

<td class="title"><a href="{link}">{title}</a><span class="comhead">{description}</span></td>
 Step 5. Click on make RSS feed. and now you can use the resultant URL in your RSS reader.

Note: If it does not work, it could be that the browser you are using gets HTML thats different than the one on my web server. On the web server I simulate Internet Explorer as most sites are customized for it. So try getting HTML out of IE instead.

If you don't want a variable's content in your resultant xml just use the word dummy in any part of the variable like this {dummy123}. This is especially useful for making RSS feeds where it would accept anything other than the standard node names.

No comments:

Post a Comment