Archive for the ‘Nutch’ Category
“Blogging Persai” is the title of the blog run by the Persai guys. If you needed an indication of how this post is going to proceed, a major hint would be that I was sorely tempted to give the title “Flogging Persai” to it. For a bunch of guys who have been extremely trigger happy during their Uncov.com days to stamp almost everything with the dreaded “FAIL,” it is rather interesting that their own product is nothing short of a half-baked proof of concept that has been cobbled together for reasons that don’t go beyond, well, the fact that it can be done.
Persai, according to the founders, is an ad-supported content recommendation system. Over time, the guys have crawled a truckload of RSS feeds(there used to be a blog entry which said as much, but is not there on the bog anymore, but Sam Ruby has the list here), indexed and classified them and this in turn powers the recommendation system. You can subscribe to “interests” (known as keywords for the rest of humanity) and get sources thrown at you which the system thinks are relevant to you. While you can’t do much else with the sources, since Persai does not have a built-in feed reader, you can reject sources. And that is all there is to see about Persai. Well, at least for now.
Use Case: Recommendation systems have not traditionally fared too well on the internet. Previous players like Greg Linden’s Findory used to do a lot more than what Persai even does today and have not done too well at all. In fact, Findory, rather sadly, shut shop recently. The only recommendation system (which works in a stealthy manner) is Google News, which works because they don’t blatantly involve you in the recommendation process.
Once you find content on Persai, there is not much to do with it. Fulfillment is a term that is at best very vague on Persai. You can, as they claim, track the topics, but those links lead out the website anyway. Individual interests have RSS feeds that you can subscribe to, but you can already do that with Google News Alerts and other products. I do doubt if anyone is going to use Persai just to have search term driven RSS feeds.
Accuracy: The approach that Persai has taken to classification involves the usage of training data. This approach works well on similar data sets, but the moment you deviate from the similarity, the entropy will be of a magnitude which will send the classifier on a wild goose chase. And as expected, this has an adverse impact on the accuracy of the results. For instance, one of my interests — “mameo” — throws back results at me which has nothing to do with Mameo in the first five results. I could, of course, reject these sources and help improve Persai, but why would anyone do that when there are other avenues that provide me with much more accurate results?
Speed: To do classification, Persai is already using Hadoop’s MapReduce. Mapreduce does an amazing job of distributively processing huge chunks of data (freshly crawled data to be indexed and classified in this case), but it may only help Persai to a certain extent. The reasons for this are simple: If they process interests as unique to each user, it just won’t scale up. There will be numerous threads doing classification for the same interests since they are unique.
And if the interests are not tracked as a unique item per user, it can play havoc with the results with different users rejecting different sources for different reasons. Of course, there are workarounds for it by using a mix of both approaches (classify as non-unique, filter on display by excluding user-specific rejection criteria), but in the end it ends up being a hack.
In any case, the approach results in tremendously outdated results. Some of the interests have really old articles on top. This could also be due to the fact that the sources are manually added into the system, which means that the quality and spread of the sources will be dependent on the bias of the person who is selecting them. Moreover, it another issue that sites without RSS feeds will not be able get into Persai.
Splogs: Possibly the group that will be over the moon about Persai would be the thugs who run splogs. With Persai it becomes ridiculously easy to set up automated blogs based on topics and, honestly, I see more people using Persai for this than anything else.Considering that Persai is still in beta, I would not give it the “FAIL” rating, but I would certainly give it the “FRAIL” rating. I hope it becomes a much better by the time it comes out of private beta.
Updates have been far and few in between here due to the same old reasons: life being mostly all work and very little play. There are a couple of pretty interesting developments that has been in the works, I will write more about them if and when they work out. Meanwhile, in the technology sphere, other than the recent and continuing dalliance with Lucene, Nutch and crawling the tubes, there is one bit of technology – RDF and the semantic web – that’s been taking more and more of my thought cycles.
Firstly, I will readily admit to not yet understanding the core concepts — triples and the subject, object and predicate soup — to the required level of finesse, but I am trying hard to implement a subset of it in regular and existing applications, so that data can describe itself and generate multiple views that would otherwise be pretty much impossible to. This is also necessary because most of the tools in the RDF space — including application frameworks and data browsers — are far from being scaleable or stable enough at this point. Besides, paradigm shifts are best left out when it is based on currently evolving technology.
Over the past week I’ve been putting the finishing touches on version one of an internal API and remodeled it to be REST-compliant than to use XML-RPC as the earlier one used to do. It took me a while to wrap my head around URIs being used as unique identifiers and other concepts, but you can call me a convert now, after having seen the positives, thought I will admit that getting the URIs right is one hell of a bitch.