10 November 2009
What they've revealed so far is a semantic search engine which builds on Freebase to search for topics instead of keywords. If you haven't heard of it, Freebase is a semantic general knowledge database that takes Wikipedia and makes it more structured allowing easier automated processing (as opposed to human reading). Zing, van Hoff's previous gig, used Freebase , so he's no newcomer to its capabilities. It'll be interesting to see what he does with it.
Ellerdale is indexing the web too, but a lot of their focus seems to be on Twitter. They map hashtags to topics, keep track of trending topics, and show a real-time stream of relevant tweets. No hint as to what their business model might be, but one might guess that it'll be advertising based.
In addition to their web app, Ellerdale has a simple RESTful API which exposes some of the inner machinery for reuse in mashups. For example, here's everything they know about Angelina Jolie. If you look at the JSON results, you can see links back to Freebase, Wikipedia, and the New York Times, as well as a bunch of categories which appear to represent the union of Freebase types/properties and Wikipedia categories. The API covers the basics, but that's about it. For example, there's no way to twidle any of the knobs and dials that control how it determines topics are related. No API key required. No word about quotas.
The Ellerdale IDs look a little like GUIDs, but they're more like serial numbers. They start at #1 with the basics like English Noun, Verb, Adverb, Adjective, then progress on to Person, Male, Female, Date, String, Hashtag, and Category.
Beginning at 0000-1000, they've loaded all the words from the Princeton WordNet corpus including adverbs, verbs, adjectives, and nouns, although the API doesn't expose the synsets, if they've got them loaded. Whatever natural language processing that WordNet is being used for is not exposed in any native way through the API -- just its results.
After that, we've got the full hierarchy of Wikipedia categories from 0010-0000 to 0019-0001. Around 0080-0000 the topics/concepts themselves begin including some rather esoteric stuff like postal codes from Freebase which don't exist in Wikipedia and after that are the hashtags like 00fa-0001 #mala
Although I'm guessing that they get most of the Wikipedia content by way of Freebase, they appear to have some type of side channel to get fresh topics directly because they have things like the Motorola Droid (#droid), but without any link back to Freebase (or Wikipedia for that matter).
That it's it for a first look. I'd love to hear more about the project from anyone who's got info to share.
17 August 2009
Freebase should now have a complete copy of all National Register of Historic Places entries which are of International, National, or State significance. The Local significance listings still used the old strategy of only reconciling existing topics.
Below is a summary of the before and after counts. We picked up 4,535 entries which had either been added to Wikipedia, added to the Register, or both. On top of that we created another 20,553 entries, bringing the grand total to over 35,000 listings.
|Starting Count||Existing Topics Reconciled||New Topics Created||Ending Count|
Each topic contains a fair amount of information, so the entire load amount to about 750,000 "facts" (or "triples" in RDF-speak), bring the total number of facts that I've written to Freebase to over 1.1M. Unfortunately, their "tallybot" which does the nightly updating of totals has been broken for a while, so I'm only getting credited with a paltry 300K.
The one remaining loose end is to try and do a better job of reconciling the architects/builders and what the Park Service calls "significant people" associated with the listing. This will require human vetting of a queue of tasks, so it'll require some additional infrastructure to be put in place before I can set people loose on working on it.
29 June 2009
26 March 2009
One thousand students will be paid $4500 by Google for a summer of working on open source projects and will be mentored by experienced open source developers. To my mind, the experience and mentoring is almost more valuable than the case (although obviously that varies greatly depending on the economic situation of the student).
If you look at the list of projects, you'll see that there's something for every taste. Projects range from low-level bit banging in C on bare iron to bioinformatics to games to a wide variety of so-called "social" apps in a wide variety of different programming languages. Students and mentors come from almost one hundred different countries as well, so there's an enormous amount of diversity on that front as well.
I've been a mentor for three of the four years the program has been in existence (2006, 2007, 2008) and last year had the satisfaction of seeing one of my original students become a mentor himself. It's a lot of work, but very satisfying. Unfortunately my project won't be participating this year due to a combination of cutbacks at Google (about 10%) and a desire to rotate in new organizations, but the ArgoEclipse team would still love to mentor any new folks, students or other, who are interested in getting their feet wet with open source development.
11 March 2009
What he calls "keys" are called "enumerated properties" in the Freebase documentation and there's an article on how to set them up. Unfortunately, the schema editor was broken when I was working on the National Register of Historic Places database schema, so I had to resort to reverse engineering things from the Explore view (accessible by pressing F8 on any page and scrolling to the bottom of the page) and then modifying the schema's property type by hand using their MQL query language. You can see the end result in the schema where item_number is typed as an enumeration.
There's also a good article on how to create a URL template that I used successfully to link to the original application submissions. For the Congressional Bioguide, it can be used to link back to the original biography.
Coincidentally and independently from Raymond's project, I was actually working on loading up all the Congressional Bioguide ID's last weekend because they are used in the XML form of legislation on THOMAS, which is run by the Library of Congress. I decided to take a slight detour to write a little name parser and Freebase name queryer in Python, so haven't actually gotten around to loading the IDs yet. One of the biggest problems in working with Freebase is reliably resolving personal names. They typically only have the main name that was used as the Wikipedia article name. There's really no telling what name form the article's editors will have chosen and even though the full name and some aliases are often identified in the opening sentence of the article, Freebase doesn't import this information from Wikipedia.