Archive for the ‘Java’ Category
“Blogging Persai” is the title of the blog run by the Persai guys. If you needed an indication of how this post is going to proceed, a major hint would be that I was sorely tempted to give the title “Flogging Persai” to it. For a bunch of guys who have been extremely trigger happy during their Uncov.com days to stamp almost everything with the dreaded “FAIL,” it is rather interesting that their own product is nothing short of a half-baked proof of concept that has been cobbled together for reasons that don’t go beyond, well, the fact that it can be done.
Persai, according to the founders, is an ad-supported content recommendation system. Over time, the guys have crawled a truckload of RSS feeds(there used to be a blog entry which said as much, but is not there on the bog anymore, but Sam Ruby has the list here), indexed and classified them and this in turn powers the recommendation system. You can subscribe to “interests” (known as keywords for the rest of humanity) and get sources thrown at you which the system thinks are relevant to you. While you can’t do much else with the sources, since Persai does not have a built-in feed reader, you can reject sources. And that is all there is to see about Persai. Well, at least for now.
Use Case: Recommendation systems have not traditionally fared too well on the internet. Previous players like Greg Linden’s Findory used to do a lot more than what Persai even does today and have not done too well at all. In fact, Findory, rather sadly, shut shop recently. The only recommendation system (which works in a stealthy manner) is Google News, which works because they don’t blatantly involve you in the recommendation process.
Once you find content on Persai, there is not much to do with it. Fulfillment is a term that is at best very vague on Persai. You can, as they claim, track the topics, but those links lead out the website anyway. Individual interests have RSS feeds that you can subscribe to, but you can already do that with Google News Alerts and other products. I do doubt if anyone is going to use Persai just to have search term driven RSS feeds.
Accuracy: The approach that Persai has taken to classification involves the usage of training data. This approach works well on similar data sets, but the moment you deviate from the similarity, the entropy will be of a magnitude which will send the classifier on a wild goose chase. And as expected, this has an adverse impact on the accuracy of the results. For instance, one of my interests — “mameo” — throws back results at me which has nothing to do with Mameo in the first five results. I could, of course, reject these sources and help improve Persai, but why would anyone do that when there are other avenues that provide me with much more accurate results?
Speed: To do classification, Persai is already using Hadoop’s MapReduce. Mapreduce does an amazing job of distributively processing huge chunks of data (freshly crawled data to be indexed and classified in this case), but it may only help Persai to a certain extent. The reasons for this are simple: If they process interests as unique to each user, it just won’t scale up. There will be numerous threads doing classification for the same interests since they are unique.
And if the interests are not tracked as a unique item per user, it can play havoc with the results with different users rejecting different sources for different reasons. Of course, there are workarounds for it by using a mix of both approaches (classify as non-unique, filter on display by excluding user-specific rejection criteria), but in the end it ends up being a hack.
In any case, the approach results in tremendously outdated results. Some of the interests have really old articles on top. This could also be due to the fact that the sources are manually added into the system, which means that the quality and spread of the sources will be dependent on the bias of the person who is selecting them. Moreover, it another issue that sites without RSS feeds will not be able get into Persai.
Splogs: Possibly the group that will be over the moon about Persai would be the thugs who run splogs. With Persai it becomes ridiculously easy to set up automated blogs based on topics and, honestly, I see more people using Persai for this than anything else.Considering that Persai is still in beta, I would not give it the “FAIL” rating, but I would certainly give it the “FRAIL” rating. I hope it becomes a much better by the time it comes out of private beta.
Peter Cranstone, while pondering how the Google phone will deliver ads to its users, says that Google will have to do something similar to what Opera does with Opera Mini — transcoding web pages — for the Google phone. He adds that once Google gets around to doing this it will beat the crap out of Opera Mini, which probably won’t find much agreement with Russell Beattie, who argues that someone should buy Opera just for the traffic that is now routed through Opera Mini.
What both gentlemen are probably not aware of is that Google already has a transcoder that converts pages into mobile-formatted on the fly. Now, rather strangely, the interface is not available anywhere as a start page as far as I know. Google does serve you a mobile-specific Google.com page depending on your User Agent, but the links that are delivered in the results page do not use the transcoder.
The only place where you can see it is if you use the mobile version of the Google Reader. In the entry-level screen on Google Reader, there is a link that says “see original,” which can also be accessed by pressing ‘0’ on your mobile phone. To access any normal page on your desktop browser via this transcoder, all you have to do is to append the URL you want to browse to the following URL: http://www.google.com/gwt/n?u=. For example this blog can be accessed this way: http://www.google.com/gwt/n?u=https://fatalerror.wordpress.com.
Currently, the transcoder supports most standard HTML, including forms, which means that you get to access things like email on the go even on a very low-fi handset, and also that Google gets another bit of your personal information (did I hear the privacy paranoid let out a collective gasp there?) for it to index and profile. The good part of the story is that it refuses to transcode secure URLs, which I remember was not the case with Opera Mini.
Now, here I also have to admit here that Opera Mini does a stellar job, but it also has a problem that you need to have J2ME support to be able to use it. Besides, the Google transcoder seems to be considerably faster while transcoding and rendering pages. For all you know, Google maybe licensing Opera’s technology to do this (imagine: Opera Mini kills Opera Mini. What a headline!), but from what I remember Opera is running a mightily hacked up version of the Opera browser as middleware to make Opera Mini possible, while Google’s approach seems to be in line with the more standard HTML Tidy/HTML Cleaner/HTML Parser/Tagsoup approach to de-mucking web pages, albeit a monstrously hacked version of it.
Updates have been far and few in between here due to the same old reasons: life being mostly all work and very little play. There are a couple of pretty interesting developments that has been in the works, I will write more about them if and when they work out. Meanwhile, in the technology sphere, other than the recent and continuing dalliance with Lucene, Nutch and crawling the tubes, there is one bit of technology – RDF and the semantic web – that’s been taking more and more of my thought cycles.
Firstly, I will readily admit to not yet understanding the core concepts — triples and the subject, object and predicate soup — to the required level of finesse, but I am trying hard to implement a subset of it in regular and existing applications, so that data can describe itself and generate multiple views that would otherwise be pretty much impossible to. This is also necessary because most of the tools in the RDF space — including application frameworks and data browsers — are far from being scaleable or stable enough at this point. Besides, paradigm shifts are best left out when it is based on currently evolving technology.
Over the past week I’ve been putting the finishing touches on version one of an internal API and remodeled it to be REST-compliant than to use XML-RPC as the earlier one used to do. It took me a while to wrap my head around URIs being used as unique identifiers and other concepts, but you can call me a convert now, after having seen the positives, thought I will admit that getting the URIs right is one hell of a bitch.
The time has finally come for me to get my hands dirty with a piece of technology — Java — that I have stayed away from. The week has started bright and early with the lovely NullPointerExceptions, servlet errors and other bits of incomprehensible language that I am getting my head around. No wonder the Java web application deployments have the .war extension, it is almost a war getting them to work at times. That said, Nutch and Lucene is one impressive bit of technology, so is the fact that Zend Search is binary compatible with Lucene indexes. It is not wonderful to have the best of two different worlds on the same platter?