03 October 2012

Pywikipedia and author identifiers in Wikipedia

I'll admit that sounds like a strange combination of topics, but I'll explain.  I recently saw a mention of VIAF identifiers being added to Wikipedia articles.  That's interesting because VIAF is a union catalog of the world's major libraries' "authority files" (their term for a list of names they control) and using a VIAF id would allow you to bridge to all the constituent catalogs.  It's also one of the identifiers associated with Freebase topics.

When I looked into it, I discovered that Wikipedians had already been adding Library of Congress identifiers, so the VIAF id was just an incremental improvement.  The VIAF additions were supposed to have been done in August, so I wanted to see how many there were compared to the older LC identifiers.

In the past I've written little custom Python programs to query information like this, but I recently came across pywikipedia (aka PyWikipediaBot) which is perfect for tasks like this.  It's got a program which will count template transclusion (ie pages that include a template) as a standard component.  You give it the name of a template, tell it whether you want a list or just a count and it'll query the Wikipedia API to get your results.

$ python templatecount.py -count -namespace:0 Authority_control
Getting references to [[Template:Authority control]] via API...
...
Number of transclusions per template
------------------------------------
Authority_control: 5183

Hmm, that's not as many as I'd hoped.  We selected namespace 0 to restrict our count to the main articles as opposed to talk pages, user pages, etc.  If we replace -count with -list, we can get a list of all the articles. The first time you run any of the pywikipedia tools it'll ask you a few questions to establish defaults for wiki family (wikipedia, wikitravel, etc), language, username, etc, but these can all be overridden on the command line.

The tool allows you to qualify a template name with a parameter name, so we can look at the breakdown between VIAF and LCCN.

$ python templatecount.py -count -namespace:0 \  
    Authority_control/VIAF Authority_control/LCCN
Authority_control/VIAF: 3569
Authority_control/LCCN: 4122

So it looks like there are roughly equal numbers of each and, based on the total count, most templates probably contain both.

One of the things that I noticed when looking at the Template:Authority_control documentation is that Normdaten is an alias for it and looking at the counts shows it's actually used.

$ python templatecount.py -count -namespace:0 Normdaten
Normdaten: 1227

That's interesting.  I wonder what the story behind that is?  Naturally the mind immediately wanders to German Wikipedia.  I wonder if that template is used there and, if so, how frequently.  Fortunately for us, the tool can query a different Wikipedia with the flick of a switch by adding -lang:de.

$ python templatecount.py -count -namespace:0 -lang:de Normdaten

Normdaten: 254890

Wow, a quarter million identifiers! That's more like what I was hoping for.  German Wikipedia is much further ahead in adding strong identifiers to their articles.  They started with a big push in 2010 and have been steadily adding them ever since as you can see from this graph.

Strong idenifiers in German Wikipedia

Next up -- how to actually retrieve template parameter values...

02 October 2012

Citizen Curation of Smithsonian Metadata

Museums and cultural institutions have started releasing their catalog metadata under more liberal licenses recently.  Culturally this is a big change for them because traditionally they are hoarders and "guardians of the truth," so it's hard for them to give up control.

There are enormous benefits to sharing their metadata though and, as they're gradually coming to realize, they'll receive back more than they give.  One of the first museums to do this was the Powerhouse Museum who's collection metadata was used as an example for the Free Your Metadata project.

Two big benefits of having your metadata publicly available are that 1) it can be linked to other sources to find interesting connections in other collections, archives, etc and 2) you can get other people to help you clean it up and correct it (bonus!).

The Cooper-Hewitt National Design Museum, part of the Smithsonian, is one of the latest to make its collection data available online for anyone to reuse under a CC0 license.  A couple of weeks ago they released a concordance of their people matched up with entries in other data sets such as Wikipedia, Freebase, the Library of Congress Name Authority File, etc.

They describe it as a pre-alpha release (we'll see why in a minute) and explicitly ask for help in cleaning it up, so let's take a look at how we might help out as volunteer citizen metadata curators.

The looking at the people, we can see that it include "corporate persons" (e.g. IBM) and has not only artists, but clients, manufacturers and anyone who might have a role in the creation or provenance of a design artefact.  There are 4,869 entries in their database, but the concordance only contains the 1,507 which have been matched against Wikipedia.  There are 725 links to Freebase, 222 to the Museum of Modern Art, 113 to the Virtual International Authority File (VIAF, a union catalog of authors by the world's libraries), etc.

Let's look at an example page on their web site.  The page for Frederic Edwin Church includes links to both Wikipedia and Freebase.  If we follow the Freebase link, we see that it has additional links including not only Wikipedia, but the Library of Congress, New York Times, VIAF, etc.  It also has a large amount of structured data in both human and machine-readable form including his works, exhibitions of his work, books about him, his teacher and a student.  It is this network of data that we're connecting to when we attempt to link Cooper-Hewitt entries to this web of links.

Down to work!  We're going to use Refine (aka Google Refine née Freebase Gridworks).  You can download it for Windows, Mac, or Linux here.  When you start it, it will open a page in your web browser where you can create a new project.  We're working with public data here, but it's a desktop app and all the data stays local, so if we were working on private data, we wouldn't need to worry about it escaping into the cloud.  Create a new project from the CSV file and turn off Parse cell text into numbers, dates, ... so that the identifiers with leading zeros don't have them changed.


Let's do a quick analysis of what columns are populated.  The spreadsheet uses the signal value 0 in many of the columns to indicate something is which doesn't have an ID, but uses blank for this in the Freebase column  Let's get convert all the 0s to blanks to make them the same. Select Edit for the first cell in the viaf:id column and delete the 0, then click Apply to All Identical Cells. All the 0s in the column are gone!  Repeat the process for the other columns.



We'll look up the Wikipedia IDs in Freebase to see what they refer to.  Because Freebase includes all the Wikipedia IDs, we can do the lookup by ID instead of name.  Here's how:
  • Select Reconcile->Start reconciling...
  • Choose Freebase Query-based Reconciliation
    • select the a Freebase key in option
    • select the this namespace option
    • fill in the namespace identifier /wikipedia/en_id which represents the numeric identifier associated with the English Wikipedia article
When everything is filled in, click Start Reconciling.  In less than a minute, you'll have all 1507 IDs linked to their corresponding Freebase topic.  There is another namespace with the ID /wikipedia/en which uses the title from the Wikipedia URL if you ever need to match that way.  Very handy for enhancing someones list of Wikipedia links with additional information!'

There are now two new facets on the left side of the screen.  Looking at the facet labeled wikipedia:id judgement, we can see that not all IDs were matched up.  If we click none to select only those entries which didn't match and check out a few of them on the Cooper-Hewitt site, we see that they are things like disambiguation pages, redirects, etc.  Freebase doesn't include these because they don't represent a single thing.  We don't know for sure, but perhaps the matching process used by the museum only took into account name matches without checking the type of the thing being matched.



Let's delete all these bad matches.  It's possible that the correct link is on that disambiguation page and could be found by a human, but for our purposes it's easier to just put all of these back in the to-b-processed bin and deal with them later.

  • Make sure the none entry is selected in the wikipedia:id judgement facet and nothing is selected in any other facet
  • in the wikipedia:id  menu select Edit cells->Common Transforms->Blank out cells

As a sanity check, let's make sure that the given Freebase IDs all match their corresponding Wikipedia IDs.  Now that our wikipedia:id column is linked to Freebase, we can look up any other value that Freebase knows.  Let's add a column containing the Freebase Machine ID.
  • Select Edit column -> Add columns from Freebase...
  • In the Add property input field type Machine ID and choose /type/object/mid
  • confirm that the preview display shows the MIDs and then click OK
Now we'll configure our facets so that we can see if the values in our newly added column match those that existed in the concordance.
  • In the freebase:id column menu, choose Facet->Customized Facets->Facet by blank
  • Select false in the newly created facet (ie non-blank values)
  • In same column, choose Facet->Custom Text Facet... and use the expression: grel:value==cells['Machine ID'].value


We see that all 782 non-blank values match.  So far, so good.  Click Remove All to clear the facets.

You may have noticed that we gained an extra 263 rows when we added the new column.  In Freebase a topic can have multiple MIDs if it was merged during its lifetime.  This allows existing references to continue to work.  It doesn't matter which one we pick and since Cooper-Hewitt appears to have chosen the first one, we'll delete the extra rows with others.
  • Select (blank) in the wikipedia:id judgement facet.  This column was fully populated before, so this selects all the new rows.
  • In the All column menu choose Edit rows-Remove all matching rows
Now let's add a some value. There are 462 entries which have Wikipedia IDs, but no Freebase ID.  Let's fill those in.  
  • Select Edit cells->Transform... from the column menu for the freebase:id column
  • Use the expression cells['Machine ID'].value
  • After confirming the results we can remove the Machine ID column Edit column->Remove this column
As a sanity check, let's see if the names of the Freebase topics match the names in the ch:name column.  We'll use a Custom Text facet on the wikipedia:id column (which has been reconciled against Freebase) with the expression:  cell.recon.match.name.unescape('html')==cells['ch:name'].value


to compare the values in the two columns.  This takes the name of the matched Freebase topic, unescapes the HTML entities (a workaround for a display quirk) and compares the result to the value in the other column.  We've got 898 exact matches, 346 mismatches, and 263 topics which aren't in Freebase for some reason.

If we take a look at the mismatches, we can see a variety of types of differences.  There's Création Baumann for Création Baumann AG and Honeywell instead of Honeywell, Inc.  because the Freebase standard is to not include suffixes like Inc., Co., AG, SARL, etc.  There are minor capitalization differences like Jean le Pautre for Jean Le Pautre and variant spellings like Wenceslas Hollar instead of Wenceslaus Hollar.

Other differences seem more significant and probably deserve closer review.  For example, Henry Dreyfuss Associates clearly isn't the same thing as Henry Dreyfuss himself (although in the world of Wikipedia they could easily be discussed in the same article).  Ditto Chermayeff & Geismar as compared to Ivan Chermayeff.

One way to tackle this is to attempt to reconcile the 600 or so mismatches and non-matches to see if we can match them up with a Freebase topic.  We'll just use the name as given by Cooper-Hewitt and see what we get for matches.  We can then compare the two sets of IDs to see where the problems lie.
  • In the wikipedia:id judgement  facet, select both false and (error) (click include in the hover menu to select the second entry)
  • In the ch:name column, choose Reconcile->Start reconciling
  • Select the Freebase Reconciliation Service
  • Select no particular type (another possible strategy would be to first reconcile against Visual Artist then Organization, etc, but we'll take our chances and try to do them all at once)
When we're reconciling against Freebase, we may occasionally run across duplicates such as Edward Fella and Ed Fella.  We can flag these to be merged on Freebase.  This will queue them up for voting by the Freebase community and as soon as three people agree, the merge will be processed.  Of course, we have to make sure that we don't confuse something like the book Edward Fella with the person.

Freebase isn't limited to topics which have articles in Wikipedia, so we'll find some additional matches such as Ronan or Erwan Bouroullec who have an entries with their Netflix and IMDB IDs because of their appearance in the movie Objectified.

Freebase also allows us to create entries which might be missing.  For example, there's an entry for Donald Deskey, but none for his firm Deskey-Volmer.  We can not only create it, but also link the new entry to both founders.

Of course the fact that the names match doesn't necessarily mean that we've got the correct entry.  The singer Beck didn't manufacter the lens for this old camera and Elizabeth I wasn't the 12 year old who embroidered this sampler in 1834.  We've done enough for now, so we'll come back in a future installment and try to clean up more of this.

The step for now is to publish our improvements so that others can take advantage of them.  We'll export our improved data as a CSV from Refine and commit it to our fork on Github.  If you're rather start from where I left off, feel free to clone my copy of the data and improve it some more.

In future posts, we'll look at some additional cleanup and enhancement strategies as well as examine some interesting things that we can do with the interconnections that have been created.

You can discuss and provide feedback on Google+

17 July 2010

First thoughts on Google acquisition of Metaweb

Yesterday Google acquired Metaweb, owners of Freebase, for an undisclosed price in a cash transaction which has already closed. The sixty or so employees moved out of their old offices Friday afternoon and will be starting in the Google SF offices on Monday. I'm sure everyone is relieved to be staying in San Fran rather than having to trek down to the Googleplex.

This follows Google's acquisition of ITA for $700 million at the beginning of July which will not only bolster their capabilities in the travel vertical, but also includes the Needle database and Thread query language technology as well as some back end web scraping technology to harvest data to feed it. (I should do a separate post on Needle based on my notes from their presentation at the Cambridge Semantic Web meetup.) It'll certainly be interesting to see how these two new acquisitions fit together with existing efforts like Google Squared (which already uses Freebase). See for example these views of Kurt Vonnegut's books on Squared and Freebase.

Google's director of product management for search, Jack Menzel, wrote in the Metaweb announcement that they are interested in enhancing search through a "deeper understanding" (i.e. "semantics") of queries and web pages. Of course the Semantic Web folks immediately claimed the news as validation of their decade of work, but I don't think it's that simple. It'll be some time before it's clear what Google was after with this acquisition and how they'll use it.

What are some of the things that Google might have been interested in?

People - Metaweb has some bright engineers working in a variety of areas include their proprietary graph store ('graphd'), data mining, machine learning, semantic web, alternative UIs, etc. They already hired one of the graphd engineers a few months ago and may have decided to get the rest of the engineers in one go instead of piecemeal.

Technology - There are a number of interesting technology components, some visible and some not:
  • graphd - their home-grown graph database
  • Metaweb Query Language (MQL) - a JSON-based query-by-example style query language
  • Acre - a server-side Javascript application development environment and hosting service
  • Wikipedia import pipeline - extracts data from infoboxes and text from articles
  • entity reconciliation - backroom Hadoop based technology used to reconcile data sets and do graph merges
Patents - Metaweb has a number of patents and patent applications which could be of interest to Google. This post contains a list of some of them. They range from early Hillis patents covering the concept of a "meta" or "knowledge" web to more recent ones on graphd technology.

Bing chaos - Microsoft bought Powerset a couple of years ago and uses the technology in Bing. At the time Powerset used Freebase data. Perhaps messing with prime competitor in search held some attraction for Google.

Freebase - Freebase is Metaweb's collaboratively maintained data wiki which was bootstrapped with Wikipedia data, but now also includes information from MusicBrainz, Open Library, and a number of other public domain data sources as well as cross-links to less liberally licensed databases like IMDB, NNDB, NY Times, etc. Although much of the data is available in their data dumps, not all of it is and many interesting analyses can only be done on the full data set.

There are some interesting views in the comments posted on Techcrunch article. Read/Write Web and GigaOM also has a pieces. I agree with the view that this was likely a relatively cheap deal that went at a low multiple of the $57 million that Metaweb had raised. It's a good deal for both parties because Google got good people and good technology at a cheap price and the VCs got an exit for a company that had yet to figure out a business model without having to pump more money in to sustain them until they did.

Time will tell what impact this will have on Freebase and, more generally, open data and the semantic web communities. Google said that it plans "to maintain Freebase as a free and open database for the world" as well as "contribute to and further develop Freebase," but this could be done at broad range of investment levels with a corresponding range of outcomes.

From a personal point of view, I'd like to see Freebase survive not only because I've contributed 1.4 million facts to it, but because I think its model of collaborative schema development and strict reconciliation has some advantages over the distributed "anyone can say anything" model which is more popular in the academic/W3C Semantic Web space. I also think the combination of machine-based and human reconciliation has huge potential that Metaweb had only barely begun to scratch the surface of. If Freebase withers, it'd be tempting to recreate it. The barrier to entry is much lower with today's technology than it was when Metaweb was first starting.

I've got a lot of ideas for synergy among Google, Metaweb, and ITA as well as some thoughts on the implications for current Freebase app developers, but this is long enough, so I'll save those for separate posts.

29 March 2010

Thoughts on Metaweb business strategy

Metaweb hasn't announced its new strategy yet, but supposedly will soon, so I'm writing down my suggestions in advance, so we can compare and contrast when it appears. Just to be clear, this is not based on any insider knowledge of any kind and does not represent the views of Metaweb Technologies Inc.

The Metaweb (or Freebase) business strategy has always been a bit of an enigma. They said they were building "The World's Database" and would charge for something later, although it hasn't be clear what.

So what would I do? Here are some thoughts (on how to develop the strategy, rather than the strategy itself):
  • Hire (or promote) a Director of Product Management - Not because that's what I do, but because, while they've had good product management in individual areas like their custom app dev environment, they've been hugely stovepiped and don't appear to have an overall product strategy. The product strategy is clearly going to be driven by the executive team and board in a startup, but someone has to be in charge of focusing the discussion in a way that will produce a concrete and implementable strategy, implementing that strategy, and then revising it based on real world customer feedback.
  • Focus - They've done everything from their own database engine and query language (arguably a competitive differentiator), to their own bulletin board system (definitely not!) to a complete development environment with its own version control. A startup can't afford the same expansive vertical integration strategy that an IBM or HP pursues.
    Focus is key. They need to focus only on those things which are absolutely critical to success and survival. The generous initial funding ($57M to date with a $42M tranche two years ago), may have actually been a curse in this regard.
  • Holistic view - Metaweb appears to consider their various software components, their data integration efforts, the resulting data, their volunteer community, and their (potential) commercial customers as independent things which can be optimized separately when they're all inextricably linked, to one degree or another, to each other. It doesn't matter how pretty widgets are if, when I link to Boston from my family-oriented site, the default page shows it as the filming location for the porno flick Slave Workshop Boston.
  • Customer Engagement - The only place to tell whether you're winning, losing, or standing still is in the marketplace. More customer involvement is critical. Both to refine product & service requirements as well as to generate design wins that can be used for marketing.
  • Developer Ecosystem - A vibrant developer community is critical to success. Building this means not only providing the right libraries and tools, but recruiting the developers, training them, and making them successful. This doesn't mean huge corporate machinery is required, but it needs to be a dedicated, ongoing goal for someone. If you look at successful developer programs, non-code assets and processes are at least as critical as the raw developer tools. The business side can't be ignored either.
  • Evangelism - Most or all of the marketing staff was apparently let go in late 2008/early 2009 and marketing seems to have been an occasional, part time effort of people with other jobs since then. That doesn't work. Metaweb is, at its core, an engineering company and most engineers have a severe allergy to marketing, but, having done a lot of both marketing and engineering, I know each is critical. They have a technical product set with new concepts in an emerging market, so it's going to be a very technical sell, but it's still marketing. Someone needs to have it as their real job (and get measured on it).
    • Standards strategy - Metaweb has never said anything about what their standards strategy is or how they see their technologies relating to thos of the W3C. There's certainly a lot to dislike about some of the W3C choices, but an ugly standard is still a standard. Metaweb did implement RDF publishing support last year, but they need to say more about their long term strategy.
    • W3C/Semantic web community - Perhaps the W3C is just naturally opposed to any type of commercialism, but establishing a better relationship would be useful to both parties. Having someone of Tim Berners-Lee's visibility diss you at a venue as prominent as TED 2009, where he completely glossed over Freebase's role as one of the largest publishers of linked data, isn't good.
    • Open Source - The company has a number of open source projects, but doesn't talk much about its open source strategy. At the very least, it should claim credit for the things it does and have an easily accessible list of open source projects it contributes to.
  • Brand - They've finally realized just how misguided the choice of Freebase was (it's the only Google Alert where I need to add -c*caine to the search terms) and appear to be backing away from that brand name, as well as its associated garish orange livery and flag waving rhino logo. While there's a good case for using a single brand for both a startup and its products, I'm not sure Metaweb is the right brand since it has generic meanings and usages as well. I'd investigate establishing a new brand for the product family.
  • Human/machine synergy - I put this last, because it's not a short-term thing, but it represents huge potential for the future, in my opinion. It's an area that Metaweb is uniquely positioned to exploit, which makes it all the frustrating that they haven't made more progress on this front. The synergy between machine-based data reconciliation processes and crowd-sourced processes could create a virtuous feedback loop where machines do the drudge work and humans decide the edge cases, in the process providing training data to refine the classifiers and info extraction algorithms. They've only taken the smallest baby steps so far, but I believe this area has huge potential for those who learn to exploit this synergy effectively.

28 March 2010

Freebase Gridworks data curation and cleanup tool

I've been alpha testing the Freebase Gridworks tool from Metaweb, but haven't been able to talk about it until now. Since they just announced it, I guess it's no longer a secret.

Research scientist David Huynh has been interested in collective data operations since his days at the MIT CSAIL Simile project. You can see collective editing in this 2007 Potluck screencast. Jon Udell called this "stunning." After David moved to Metaweb, his 2008 Parallax demo showed the power of collective operations for browsing Freebase data (and UCG's DERI group forked a SPARQL version called SParallax).

The Gridworks tool is another riff on that same collective operations theme, but this time focused on data cleanup and reconciliation rather than mashups or browsing. There's a lot more to it than what you see in the screencasts (and, naturally, some limitations which are glossed over as well), but while it's still in testing I'll reserve any detailed discussion of features. Suffice it to say though, that the anticipatory buzz in the Twitter-sphere is justified. What remains to be seen is how well they'll follow through on completing the tool, as well as integrating it with the various types of data sources & sinks which are of interest to users.

From a selfish point of view, I'd like to see people use tools like this to contribute to the availability of cleaned up public data sets rather than just using it to clean their private data silos. Of course, convincing people to do that is a much bigger problem -- one which the whole Linked Data / Semantic Web community has yet to come up with a compelling answer for.