10 November 2009

The Ellerdale Project - new semantic search/trends

The Ellerdale Project (@ellerdale) has just emerged from stealth mode with a couple of discreet tweets. Art van Hoff (@avh) and collaborators have been working on this for a while now and I've been very curious to see what they come up with.

What they've revealed so far is a semantic search engine which builds on Freebase to search for topics instead of keywords. If you haven't heard of it, Freebase is a semantic general knowledge database that takes Wikipedia and makes it more structured allowing easier automated processing (as opposed to human reading). Zing, van Hoff's previous gig, used Freebase , so he's no newcomer to its capabilities. It'll be interesting to see what he does with it.

Ellerdale is indexing the web too, but a lot of their focus seems to be on Twitter. They map hashtags to topics, keep track of trending topics, and show a real-time stream of relevant tweets. No hint as to what their business model might be, but one might guess that it'll be advertising based.

In addition to their web app, Ellerdale has a simple RESTful API which exposes some of the inner machinery for reuse in mashups. For example, here's everything they know about Angelina Jolie. If you look at the JSON results, you can see links back to Freebase, Wikipedia, and the New York Times, as well as a bunch of categories which appear to represent the union of Freebase types/properties and Wikipedia categories. The API covers the basics, but that's about it. For example, there's no way to twidle any of the knobs and dials that control how it determines topics are related. No API key required. No word about quotas.

The Ellerdale IDs look a little like GUIDs, but they're more like serial numbers. They start at #1 with the basics like English Noun, Verb, Adverb, Adjective, then progress on to Person, Male, Female, Date, String, Hashtag, and Category.

Beginning at 0000-1000, they've loaded all the words from the Princeton WordNet corpus including adverbs, verbs, adjectives, and nouns, although the API doesn't expose the synsets, if they've got them loaded. Whatever natural language processing that WordNet is being used for is not exposed in any native way through the API -- just its results.

After that, we've got the full hierarchy of Wikipedia categories from 0010-0000 to 0019-0001. Around 0080-0000 the topics/concepts themselves begin including some rather esoteric stuff like postal codes from Freebase which don't exist in Wikipedia and after that are the hashtags like 00fa-0001 #mala

Although I'm guessing that they get most of the Wikipedia content by way of Freebase, they appear to have some type of side channel to get fresh topics directly because they have things like the Motorola Droid (#droid), but without any link back to Freebase (or Wikipedia for that matter).

That it's it for a first look. I'd love to hear more about the project from anyone who's got info to share.

17 August 2009

Breaking the 1 million barrier

I finished loading the updated National Register of Historic Places database into Freebase last week. In addition to containing the latest data released by the National Park Service, combined with the latest Wikipedia articles, this run created new topics where Freebase didn't have existing ones. You may remember that the initial run focused solely on reconciling existing Freebase topics.

Freebase should now have a complete copy of all National Register of Historic Places entries which are of International, National, or State significance. The Local significance listings still used the old strategy of only reconciling existing topics.

Below is a summary of the before and after counts. We picked up 4,535 entries which had either been added to Wikipedia, added to the Register, or both. On top of that we created another 20,553 entries, bringing the grand total to over 35,000 listings.

Starting Count Existing Topics Reconciled New Topics Created Ending Count
International 0 1 10 11
National 2010 699 4386 7095
State 2423 1121 16065 19609
Local 5978 2627 92 8690
TOTAL 10434 4535 20553 35518

Each topic contains a fair amount of information, so the entire load amount to about 750,000 "facts" (or "triples" in RDF-speak), bring the total number of facts that I've written to Freebase to over 1.1M. Unfortunately, their "tallybot" which does the nightly updating of totals has been broken for a while, so I'm only getting credited with a paltry 300K.

The one remaining loose end is to try and do a better job of reconciling the architects/builders and what the Park Service calls "significant people" associated with the listing. This will require human vetting of a queue of tasks, so it'll require some additional infrastructure to be put in place before I can set people loose on working on it.

29 June 2009

Featured Freebase app & base - US National Register of Historic Places

Speaking of Freebase, they've featured some work of mine that I never mentioned, so I suppose I should talk briefly about it.

Back at the end of 2008 I decided that after year of casually following Freebase that it was getting interesting enough to invest some time in learning it in a little more depth. Of course the only way to do that is hands-on, so I needed a project. I didn't want to start with an idea that had commercial potential (they're secret!) and I've got an interest in old places through my genealogy hobby, so I decided to load up the U. S. National Park Service's Register of Historic Places database. The source database is in dBase format, so grabbed a Python module to read it and started playing around with loading it into Freebase. Data reconciliation between two slightly crufty databases is a non-trivial issue, so I played around for quite a while on Freebase's sandbox before I was happy with the results and was ready to load it up on the production database.

Of course shortly after I got it all loaded the NPS released a new version of the database, so now I need to go back and update everything. That's OK though, because the first time around I'd only used the data to add types and properties to existing topics in Freebase (still over 10,000 topics with 100K+ facts). I hadn't created any new topics from scratch. This will be a good opportunity to load the entire database, at least to some level of significance (perhaps National and State, but not Local).

Another little project I did for Sunshine Week 2009 was add the Congressional Biography IDs (aka Library of Congress THOMAS IDs) to all the U. S. politicans. This ID is use in the online versions of all the bills that go through Congress, so is an important unique identifier.

Finally, another project which was just mentioned in the Freebase blog is my very first, very primitive Acre app, Untyped which can be used to find topics containing a specific keyword in their name which have no type assigned to them. Freebase is working hard to get as many topics as possible typed, so this tool can be used to help with that. Most of my other Freebase work has been done in Python, but this uses their new hosted templating engine. It's still a little rough around the edges, but has been improving a lot. Because it's hosted, you don't need to worry about running things on the Google App Engine or another hosting service.

More fun Freebase stuff in the pipe... Stay tuned!

Freebase Hack Day - Sat. July 11 San Francisco

Just two weeks until the Freebase Hack Day that Metaweb is running at their San Francisco headquarters. It's free and will feature unconference style discussions/presentations as well as general hacking. Read more about the goings on in this blog post. Although it's free, space is limited, so you'll need to register with Eventbrite (when it comes back up from its upgrade).

This is the second Hack Day they've held and I'll be in attendance for this one, so if you're going, give me shout.

26 March 2009

Google Summer of Code 2009 (GSoC2009)

If you know any students who are interested in open source software, the Google Summer of Code is a great opportunity. Encourage them to apply. The application period is open now and ends April 3.

One thousand students will be paid $4500 by Google for a summer of working on open source projects and will be mentored by experienced open source developers. To my mind, the experience and mentoring is almost more valuable than the case (although obviously that varies greatly depending on the economic situation of the student).

If you look at the list of projects, you'll see that there's something for every taste. Projects range from low-level bit banging in C on bare iron to bioinformatics to games to a wide variety of so-called "social" apps in a wide variety of different programming languages. Students and mentors come from almost one hundred different countries as well, so there's an enormous amount of diversity on that front as well.

I've been a mentor for three of the four years the program has been in existence (2006, 2007, 2008) and last year had the satisfaction of seeing one of my original students become a mentor himself. It's a lot of work, but very satisfying. Unfortunately my project won't be participating this year due to a combination of cutbacks at Google (about 10%) and a desire to rotate in new organizations, but the ArgoEclipse team would still love to mentor any new folks, students or other, who are interested in getting their feet wet with open source development.

11 March 2009

Freebase, open government, and enumerations

I'm preparing a short series of articles about Freebase, but Raymond Yee had a question about something I was working on over the weekend, so here's a quick hint to help him along.

What he calls "keys" are called "enumerated properties" in the Freebase documentation and there's an article on how to set them up. Unfortunately, the schema editor was broken when I was working on the National Register of Historic Places database schema, so I had to resort to reverse engineering things from the Explore view (accessible by pressing F8 on any page and scrolling to the bottom of the page) and then modifying the schema's property type by hand using their MQL query language. You can see the end result in the schema where item_number is typed as an enumeration.

There's also a good article on how to create a URL template that I used successfully to link to the original application submissions. For the Congressional Bioguide, it can be used to link back to the original biography.

Coincidentally and independently from Raymond's project, I was actually working on loading up all the Congressional Bioguide ID's last weekend because they are used in the XML form of legislation on THOMAS, which is run by the Library of Congress. I decided to take a slight detour to write a little name parser and Freebase name queryer in Python, so haven't actually gotten around to loading the IDs yet. One of the biggest problems in working with Freebase is reliably resolving personal names. They typically only have the main name that was used as the Wikipedia article name. There's really no telling what name form the article's editors will have chosen and even though the full name and some aliases are often identified in the opening sentence of the article, Freebase doesn't import this information from Wikipedia.