enthusiastic emoticon

This is a condensed version of a few conversations with a few people, containing some of their ideas as well, which I tried to clean up and make it something coherent. While I don’t see the harm in disseminating it, (If it gets implemented without me having to work, so much the better.) for the moment I’ll be keeping it password protected, just in case… cause I do want at least some glory out of this. :)

Web 2.0 has put the emphasis on limited numbers of large Service Sites — just like the dotcom bubble tried to make every site your single portal to internet as a whole. Unlike the earlier creations, they’re interoperable. They have exposed APIs you can use to make your own toys, and you do. However, and that’s the important bit, they keep your data halfway across the world, God knows where. They even went so far as to suck your actual applications away. They’ve reduced your computer to a brainless terminal — all in good faith, yes, in good faith and legitimate commercial interest, but that’s just web 2.0. Which leaves you with CPU power to spare.

This is just a vision of what Web 3.0 should be like.

Let’s call it your External Information Manager. It’s a local application, it’s not a service — though there might be external sites which offer you hosting tailor-made for it. It might even have a personality, if you want — in some ways it certainly does, though essentially, this personality is also yours. It sits between your browser and the rest of the internet, and its job is:

  1. To cut the crap.
  2. To disseminate the non-crap to your friends.
  3. To make sure that if there’s anything non-crap that you want, you’re the first to know.

The methods of doing so are as follows:

  • It relies on the big Services of Web 2.0 to do its thing — it sucks in megafeeds from large blog providers and statistically analyzes them to separate the fun and non-fun, spam and non-spam. But it can do it with any exposed feeds you want directly, just the same.
  • It analyzes what you read, constantly getting input from what you surf and what you want, and then goes off searching the web itself to look stuff up. And while it uses Google API more often than not, once the site is found, it spiders it by itself to catch the things Google didn’t get around to yet.
  • It keeps in touch with your friends’ EIMs1 — sending them pieces of news it knows you would want tell them about, getting pieces of news from them, asking them to help search, exposing things you want people to know you’ve read. It does this using a very standard protocol, so their bots run on different kinds of systems, written by different people, but they all work together. Your social network directly becomes an adjunct to your net-filtering system.
  • If it sees something (or someone) it knows you definitely don’t want to see — because you explicitly told it — it just cuts it out, so that you never even know it’s been there.
  • Most importantly, it’s yours. It’s not someone else spoonfeeding you the ocean of information — it’s your own spoon, which you control, you are responsible for, you tweak to suit yourself, and you move to your own mouth. Nobody can use it to spy on you, or feed you shit, unless they pry it from your cold, dead fingers, and when you leave home, you can take it with you on your external drive, so that it’s always with you.

And even if large chunks of the net keel over and die, it works just as it did.

There’s still a few important chunks of this vision missing, however, that’s what the implementation is all about:

  1. While implementing the separation of text into ham and spam is trivial using existing bayesian filtering software, I don’t think analyzing words as tokens will be sufficient. Eventually, it has to be able to tell a WWN report of a falling asteroid from a real report of a falling asteroid. It has to be able to tell good porn from bad, even. And it has to work with all languages known to man, tell between them and adjust to grammar peculiarities.
  2. This robot network will result in useless duplication of effort, so some sensible concept of sharing filter profiles is needed… but you can’t exactly share a multi-megabyte database easily. I don’t see how this could be done yet.
  3. Markup sensitive filtering to remove unwanted stuff will have to somehow fit between the browser and the net, without wrecking these very delicate things people are so fond of these days like AJAX. And with the way people still treat web standards, well, it’s going to be extremely tough.
  4. It is in no way clear how to determine what I might find interesting in the next moment, what keywords to search for and what to spider. This, however, is a very important bit.

So… thoughts?

update: Some more ideas and a little plans.

  1. I don’t know yet whether it should be a cgi-style application visible through your browser only, or a proxy/server style standalone application. Both approaches have their merits. It definitely should expose a http interface somewhere, though. It’s possible that both approaches should be used at the same time.
  2. I think that the most important part, which everything will be based on, is the my-world-is-spam text analysis module. It should form a core library, (and a core database) which will be used everywhere else. Most other things listed above involve it in one way or another.
  3. The development, obviously, will be open source, cause this thing will be so much more useful and reality-changing if it’s widespread. That posting will go public when we get at least some running, marginally useful code — by then it will no longer matter. :)
  4. Beside donations, which aren’t as easy to come by as one would hope, we might be able to earn money by providing paid hosting for EIMs in the form of a conventional web 2.0 service — with the added value that you can always grab your data and use it locally, which most smarter people will do. Notice that while with this approach it will scale in a linear fashion, the line will be quite steep, so optimizations for hosting multiple EIMs should be thought of early. That’ll also help with enterprise deployment — and I suspect corporations might find EIMs just as useful.
  5. One interesting side-effect from using EIMs is that advertising will get filtered out along with all other spam. There will be some interesting publicity consequences involved, since, while there is software like Adblock and Junkbuster which does this for you, I don’t think there were incidents of a web 2.0 service trying to make ad-removal a business.
  6. …beside web content, it could, and probably should, also filter e-mail, and maybe even instant messages. :)

  1. Probably, using XMPP, even. ↩︎