17 July 2010

First thoughts on Google acquisition of Metaweb

Yesterday Google acquired Metaweb, owners of Freebase, for an undisclosed price in a cash transaction which has already closed. The sixty or so employees moved out of their old offices Friday afternoon and will be starting in the Google SF offices on Monday. I'm sure everyone is relieved to be staying in San Fran rather than having to trek down to the Googleplex.

This follows Google's acquisition of ITA for $700 million at the beginning of July which will not only bolster their capabilities in the travel vertical, but also includes the Needle database and Thread query language technology as well as some back end web scraping technology to harvest data to feed it. (I should do a separate post on Needle based on my notes from their presentation at the Cambridge Semantic Web meetup.) It'll certainly be interesting to see how these two new acquisitions fit together with existing efforts like Google Squared (which already uses Freebase). See for example these views of Kurt Vonnegut's books on Squared and Freebase.

Google's director of product management for search, Jack Menzel, wrote in the Metaweb announcement that they are interested in enhancing search through a "deeper understanding" (i.e. "semantics") of queries and web pages. Of course the Semantic Web folks immediately claimed the news as validation of their decade of work, but I don't think it's that simple. It'll be some time before it's clear what Google was after with this acquisition and how they'll use it.

What are some of the things that Google might have been interested in?

People - Metaweb has some bright engineers working in a variety of areas include their proprietary graph store ('graphd'), data mining, machine learning, semantic web, alternative UIs, etc. They already hired one of the graphd engineers a few months ago and may have decided to get the rest of the engineers in one go instead of piecemeal.

Technology - There are a number of interesting technology components, some visible and some not:
  • graphd - their home-grown graph database
  • Metaweb Query Language (MQL) - a JSON-based query-by-example style query language
  • Acre - a server-side Javascript application development environment and hosting service
  • Wikipedia import pipeline - extracts data from infoboxes and text from articles
  • entity reconciliation - backroom Hadoop based technology used to reconcile data sets and do graph merges
Patents - Metaweb has a number of patents and patent applications which could be of interest to Google. This post contains a list of some of them. They range from early Hillis patents covering the concept of a "meta" or "knowledge" web to more recent ones on graphd technology.

Bing chaos - Microsoft bought Powerset a couple of years ago and uses the technology in Bing. At the time Powerset used Freebase data. Perhaps messing with prime competitor in search held some attraction for Google.

Freebase - Freebase is Metaweb's collaboratively maintained data wiki which was bootstrapped with Wikipedia data, but now also includes information from MusicBrainz, Open Library, and a number of other public domain data sources as well as cross-links to less liberally licensed databases like IMDB, NNDB, NY Times, etc. Although much of the data is available in their data dumps, not all of it is and many interesting analyses can only be done on the full data set.

There are some interesting views in the comments posted on Techcrunch article. Read/Write Web and GigaOM also has a pieces. I agree with the view that this was likely a relatively cheap deal that went at a low multiple of the $57 million that Metaweb had raised. It's a good deal for both parties because Google got good people and good technology at a cheap price and the VCs got an exit for a company that had yet to figure out a business model without having to pump more money in to sustain them until they did.

Time will tell what impact this will have on Freebase and, more generally, open data and the semantic web communities. Google said that it plans "to maintain Freebase as a free and open database for the world" as well as "contribute to and further develop Freebase," but this could be done at broad range of investment levels with a corresponding range of outcomes.

From a personal point of view, I'd like to see Freebase survive not only because I've contributed 1.4 million facts to it, but because I think its model of collaborative schema development and strict reconciliation has some advantages over the distributed "anyone can say anything" model which is more popular in the academic/W3C Semantic Web space. I also think the combination of machine-based and human reconciliation has huge potential that Metaweb had only barely begun to scratch the surface of. If Freebase withers, it'd be tempting to recreate it. The barrier to entry is much lower with today's technology than it was when Metaweb was first starting.

I've got a lot of ideas for synergy among Google, Metaweb, and ITA as well as some thoughts on the implications for current Freebase app developers, but this is long enough, so I'll save those for separate posts.

No comments: