Welcome To The Crunch Manifesto!
In this blog, I intend to document my design notes and rationale for a toolset, to be called Crunch, that I am intending to create. The Crunch toolset will be an extensible framework for "crunching" raw web pages into Atom feeds.
I have just started using the syndication feeds functionality built into FireFox and extensions such as Sage and find it a great way to quickly identify content that might be interesting to read. In particular, I find the FireFox "Live Bookmarks" functionality to be very convenient.
However, I was somewhat disappointed to discover that some of my favourite websites either don't provide syndication feeds (e.g. CounterPunch and Common Dreams) or don't do it well (e.g. The Sydney Morning Herald).
Well stuff 'em, I thought, I'll create my own Atom feeds from the eminently useable information that already exists encoded in HTML on their existing web pages.
So, a few days ago I created a prototype for CounterPunch which you can read more about and use here.
Then, I thought, heck if I can do this for CounterPunch, why not create a framework that allows arbitrary HTML pages to become Atom feeds. Once I have a toolset like that, I can even "fix" sites with underdone feeds.
The Sydney Morning Herald's website is a good example of a site that provides close to useless RSS feeds. The feeds they provide only cover a vanishingly small subset of their content and are updated very infrequently. For example: they have a breaking news page, but they don't have a breaking news feed. Why not? Well, one reason why not is that they don't want to lose control of the user's browsing experience, admittedly a danger that extensive use of syndication technology presents. Still, that's their problem, not mine. Other publications such as the Guardian provide a good quality feed, why can't the Sydney Morning Herald. I want a convenient news reading experience and, damn it, I shall have one!
The concept of extracting an Atom feed from an arbitrary web page is somewhat subversive. If I can extract an index from an existing page, why not the content. If I have extracted the raw content, I can read the content, styled to my preference, without all the annoying ads. Content providers are not going to like that. But then, what right do they have to specify what kind of browser I use to read their material? Not a very strong one, I think.
However, the content providers have one very simple way to defeat technology such as this - provide good quality feeds themselves. If they do that, they can guide readers back to their own sites and thus their own advertisers. Content providers could, if they were bothered, provide personalized feeds that have well abstracted summaries. Abstracting a summary is something their journalists can do much better than some automatic extraction tool. They could also offer server-side aggregation and filtering of their own content on a per-reader basis rather than, as the SMH does, providing a large array of generic feeds that the user is then forced to aggregate in some way.
So, one of the things the Crunch Manifesto is about is forcing publishers to pay more than lipservice to the syndication meme: "Do it well for us, or we will do it for you. And remember, if we do it for you, you may not like the results."
Straw Man Architecture
This is a work in progress...I'll update it as I get a chance.