Plotly

10 November 2009

The Ellerdale Project - new semantic search/trends

The Ellerdale Project (@ellerdale) has just emerged from stealth mode with a couple of discreet tweets. Art van Hoff (@avh) and collaborators have been working on this for a while now and I've been very curious to see what they come up with.

What they've revealed so far is a semantic search engine which builds on Freebase to search for topics instead of keywords. If you haven't heard of it, Freebase is a semantic general knowledge database that takes Wikipedia and makes it more structured allowing easier automated processing (as opposed to human reading). Zing, van Hoff's previous gig, used Freebase , so he's no newcomer to its capabilities. It'll be interesting to see what he does with it.

Ellerdale is indexing the web too, but a lot of their focus seems to be on Twitter. They map hashtags to topics, keep track of trending topics, and show a real-time stream of relevant tweets. No hint as to what their business model might be, but one might guess that it'll be advertising based.

In addition to their web app, Ellerdale has a simple RESTful API which exposes some of the inner machinery for reuse in mashups. For example, here's everything they know about Angelina Jolie. If you look at the JSON results, you can see links back to Freebase, Wikipedia, and the New York Times, as well as a bunch of categories which appear to represent the union of Freebase types/properties and Wikipedia categories. The API covers the basics, but that's about it. For example, there's no way to twidle any of the knobs and dials that control how it determines topics are related. No API key required. No word about quotas.

The Ellerdale IDs look a little like GUIDs, but they're more like serial numbers. They start at #1 with the basics like English Noun, Verb, Adverb, Adjective, then progress on to Person, Male, Female, Date, String, Hashtag, and Category.

Beginning at 0000-1000, they've loaded all the words from the Princeton WordNet corpus including adverbs, verbs, adjectives, and nouns, although the API doesn't expose the synsets, if they've got them loaded. Whatever natural language processing that WordNet is being used for is not exposed in any native way through the API -- just its results.

After that, we've got the full hierarchy of Wikipedia categories from 0010-0000 to 0019-0001. Around 0080-0000 the topics/concepts themselves begin including some rather esoteric stuff like postal codes from Freebase which don't exist in Wikipedia and after that are the hashtags like 00fa-0001 #mala

Although I'm guessing that they get most of the Wikipedia content by way of Freebase, they appear to have some type of side channel to get fresh topics directly because they have things like the Motorola Droid (#droid), but without any link back to Freebase (or Wikipedia for that matter).

That it's it for a first look. I'd love to hear more about the project from anyone who's got info to share.

No comments: