Clustered river of news
RSS readers have over time become pretty fully-featured software on their own. Most now provide the standard set of features: OPML import/export, categories, river of news and search irrespective of their avatar — online or offline — and I have pretty much grown used to depending on my reader of choice Google Reader to satisfy the need to read my feeds.
That said, there is one feature I’d really love to have in my RSS reader – to have clustering on feeds as an additional way to categorise data, other than the current methods of categories and tags. Think of it as a cross between your RSS reader and Google News/Techmeme. Would it not be nice to have your little personal Google News or Techmeme from the sources that you have picked than be led by what Gabe or the kind folks at Google News may have seeded their websites with?
There are, though, a couple of problems that could make this impossible:
Processing: Any algorithm that finds similarities in text is computationally intensive even in cases where the data set is limited. Scaling is often possible in such circumstances when the size of the data set is reasonably fixed and with the variance that comes in the size of different RSS subscription lists, it would be a royal pain to find a right algorithm that will scale effectively and efficiently.
Entropy: Traditional similarity match approaches work best when they cover a similar domain so that an apple would mean apple the fruit rather than Apple the company. The entropy that is found in the data set needs to be reasonable for the algorithm to function reasonably well and learning systems also need to be taught with training data, which may not be possible in this case.
Link Match: What we are then left with is to hit the problem purely by tracking outgoing links. This would thankfully involve a far less computationally intensive approach than going via the pure text analysis approach. The degree of accuracy and the utility this approach may have may not be stunning, but it would certainly be good enough for the immediate purpose – a reasonable way of classifying what my subscription list is talking about.