Wednesday, October 04, 2006

User-centric Screen Scraping

Dapper is cool. It provides an interface and service by which you can build your own screenscraper (they don't call it that) for any arbitrary site and then output the scraped content into various formats (e.g. HTML, custom XML, RSS, iCal, etc).

So it's an end-run around sites that don't themselves provide APIs. You use Dapper to scrape their pages and effectively create your own API. The key difference is that you define what gets scraped rather than somebody else.

As an example, I created a 'Dapp' from this blog, using the Dapper interface to specify that I wanted only the post titles and following metadata (e.g. posted by, # of comments etc) scraped. I chose to have XML as the output format and the result is here. No programming necessary (although some would be required in order to use the 'feed').

Inherent to screen-scraping, it's fragile - as soon as I change the blog template, the Dapp could break. I also wonder how long the window of opportunity will remain open - if Blogger had an API and gave me (or others) control over what data was relevant then the need for Dapper disappears.

No comments: