17 July 2010

First thoughts on Google acquisition of Metaweb

Yesterday Google acquired Metaweb, owners of Freebase, for an undisclosed price in a cash transaction which has already closed. The sixty or so employees moved out of their old offices Friday afternoon and will be starting in the Google SF offices on Monday. I'm sure everyone is relieved to be staying in San Fran rather than having to trek down to the Googleplex.

This follows Google's acquisition of ITA for $700 million at the beginning of July which will not only bolster their capabilities in the travel vertical, but also includes the Needle database and Thread query language technology as well as some back end web scraping technology to harvest data to feed it. (I should do a separate post on Needle based on my notes from their presentation at the Cambridge Semantic Web meetup.) It'll certainly be interesting to see how these two new acquisitions fit together with existing efforts like Google Squared (which already uses Freebase). See for example these views of Kurt Vonnegut's books on Squared and Freebase.

Google's director of product management for search, Jack Menzel, wrote in the Metaweb announcement that they are interested in enhancing search through a "deeper understanding" (i.e. "semantics") of queries and web pages. Of course the Semantic Web folks immediately claimed the news as validation of their decade of work, but I don't think it's that simple. It'll be some time before it's clear what Google was after with this acquisition and how they'll use it.

What are some of the things that Google might have been interested in?

People - Metaweb has some bright engineers working in a variety of areas include their proprietary graph store ('graphd'), data mining, machine learning, semantic web, alternative UIs, etc. They already hired one of the graphd engineers a few months ago and may have decided to get the rest of the engineers in one go instead of piecemeal.

Technology - There are a number of interesting technology components, some visible and some not:
  • graphd - their home-grown graph database
  • Metaweb Query Language (MQL) - a JSON-based query-by-example style query language
  • Acre - a server-side Javascript application development environment and hosting service
  • Wikipedia import pipeline - extracts data from infoboxes and text from articles
  • entity reconciliation - backroom Hadoop based technology used to reconcile data sets and do graph merges
Patents - Metaweb has a number of patents and patent applications which could be of interest to Google. This post contains a list of some of them. They range from early Hillis patents covering the concept of a "meta" or "knowledge" web to more recent ones on graphd technology.

Bing chaos - Microsoft bought Powerset a couple of years ago and uses the technology in Bing. At the time Powerset used Freebase data. Perhaps messing with prime competitor in search held some attraction for Google.

Freebase - Freebase is Metaweb's collaboratively maintained data wiki which was bootstrapped with Wikipedia data, but now also includes information from MusicBrainz, Open Library, and a number of other public domain data sources as well as cross-links to less liberally licensed databases like IMDB, NNDB, NY Times, etc. Although much of the data is available in their data dumps, not all of it is and many interesting analyses can only be done on the full data set.

There are some interesting views in the comments posted on Techcrunch article. Read/Write Web and GigaOM also has a pieces. I agree with the view that this was likely a relatively cheap deal that went at a low multiple of the $57 million that Metaweb had raised. It's a good deal for both parties because Google got good people and good technology at a cheap price and the VCs got an exit for a company that had yet to figure out a business model without having to pump more money in to sustain them until they did.

Time will tell what impact this will have on Freebase and, more generally, open data and the semantic web communities. Google said that it plans "to maintain Freebase as a free and open database for the world" as well as "contribute to and further develop Freebase," but this could be done at broad range of investment levels with a corresponding range of outcomes.

From a personal point of view, I'd like to see Freebase survive not only because I've contributed 1.4 million facts to it, but because I think its model of collaborative schema development and strict reconciliation has some advantages over the distributed "anyone can say anything" model which is more popular in the academic/W3C Semantic Web space. I also think the combination of machine-based and human reconciliation has huge potential that Metaweb had only barely begun to scratch the surface of. If Freebase withers, it'd be tempting to recreate it. The barrier to entry is much lower with today's technology than it was when Metaweb was first starting.

I've got a lot of ideas for synergy among Google, Metaweb, and ITA as well as some thoughts on the implications for current Freebase app developers, but this is long enough, so I'll save those for separate posts.

29 March 2010

Thoughts on Metaweb business strategy

Metaweb hasn't announced its new strategy yet, but supposedly will soon, so I'm writing down my suggestions in advance, so we can compare and contrast when it appears. Just to be clear, this is not based on any insider knowledge of any kind and does not represent the views of Metaweb Technologies Inc.

The Metaweb (or Freebase) business strategy has always been a bit of an enigma. They said they were building "The World's Database" and would charge for something later, although it hasn't be clear what.

So what would I do? Here are some thoughts (on how to develop the strategy, rather than the strategy itself):
  • Hire (or promote) a Director of Product Management - Not because that's what I do, but because, while they've had good product management in individual areas like their custom app dev environment, they've been hugely stovepiped and don't appear to have an overall product strategy. The product strategy is clearly going to be driven by the executive team and board in a startup, but someone has to be in charge of focusing the discussion in a way that will produce a concrete and implementable strategy, implementing that strategy, and then revising it based on real world customer feedback.
  • Focus - They've done everything from their own database engine and query language (arguably a competitive differentiator), to their own bulletin board system (definitely not!) to a complete development environment with its own version control. A startup can't afford the same expansive vertical integration strategy that an IBM or HP pursues.
    Focus is key. They need to focus only on those things which are absolutely critical to success and survival. The generous initial funding ($57M to date with a $42M tranche two years ago), may have actually been a curse in this regard.
  • Holistic view - Metaweb appears to consider their various software components, their data integration efforts, the resulting data, their volunteer community, and their (potential) commercial customers as independent things which can be optimized separately when they're all inextricably linked, to one degree or another, to each other. It doesn't matter how pretty widgets are if, when I link to Boston from my family-oriented site, the default page shows it as the filming location for the porno flick Slave Workshop Boston.
  • Customer Engagement - The only place to tell whether you're winning, losing, or standing still is in the marketplace. More customer involvement is critical. Both to refine product & service requirements as well as to generate design wins that can be used for marketing.
  • Developer Ecosystem - A vibrant developer community is critical to success. Building this means not only providing the right libraries and tools, but recruiting the developers, training them, and making them successful. This doesn't mean huge corporate machinery is required, but it needs to be a dedicated, ongoing goal for someone. If you look at successful developer programs, non-code assets and processes are at least as critical as the raw developer tools. The business side can't be ignored either.
  • Evangelism - Most or all of the marketing staff was apparently let go in late 2008/early 2009 and marketing seems to have been an occasional, part time effort of people with other jobs since then. That doesn't work. Metaweb is, at its core, an engineering company and most engineers have a severe allergy to marketing, but, having done a lot of both marketing and engineering, I know each is critical. They have a technical product set with new concepts in an emerging market, so it's going to be a very technical sell, but it's still marketing. Someone needs to have it as their real job (and get measured on it).
    • Standards strategy - Metaweb has never said anything about what their standards strategy is or how they see their technologies relating to thos of the W3C. There's certainly a lot to dislike about some of the W3C choices, but an ugly standard is still a standard. Metaweb did implement RDF publishing support last year, but they need to say more about their long term strategy.
    • W3C/Semantic web community - Perhaps the W3C is just naturally opposed to any type of commercialism, but establishing a better relationship would be useful to both parties. Having someone of Tim Berners-Lee's visibility diss you at a venue as prominent as TED 2009, where he completely glossed over Freebase's role as one of the largest publishers of linked data, isn't good.
    • Open Source - The company has a number of open source projects, but doesn't talk much about its open source strategy. At the very least, it should claim credit for the things it does and have an easily accessible list of open source projects it contributes to.
  • Brand - They've finally realized just how misguided the choice of Freebase was (it's the only Google Alert where I need to add -c*caine to the search terms) and appear to be backing away from that brand name, as well as its associated garish orange livery and flag waving rhino logo. While there's a good case for using a single brand for both a startup and its products, I'm not sure Metaweb is the right brand since it has generic meanings and usages as well. I'd investigate establishing a new brand for the product family.
  • Human/machine synergy - I put this last, because it's not a short-term thing, but it represents huge potential for the future, in my opinion. It's an area that Metaweb is uniquely positioned to exploit, which makes it all the frustrating that they haven't made more progress on this front. The synergy between machine-based data reconciliation processes and crowd-sourced processes could create a virtuous feedback loop where machines do the drudge work and humans decide the edge cases, in the process providing training data to refine the classifiers and info extraction algorithms. They've only taken the smallest baby steps so far, but I believe this area has huge potential for those who learn to exploit this synergy effectively.

28 March 2010

Freebase Gridworks data curation and cleanup tool

I've been alpha testing the Freebase Gridworks tool from Metaweb, but haven't been able to talk about it until now. Since they just announced it, I guess it's no longer a secret.

Research scientist David Huynh has been interested in collective data operations since his days at the MIT CSAIL Simile project. You can see collective editing in this 2007 Potluck screencast. Jon Udell called this "stunning." After David moved to Metaweb, his 2008 Parallax demo showed the power of collective operations for browsing Freebase data (and UCG's DERI group forked a SPARQL version called SParallax).

The Gridworks tool is another riff on that same collective operations theme, but this time focused on data cleanup and reconciliation rather than mashups or browsing. There's a lot more to it than what you see in the screencasts (and, naturally, some limitations which are glossed over as well), but while it's still in testing I'll reserve any detailed discussion of features. Suffice it to say though, that the anticipatory buzz in the Twitter-sphere is justified. What remains to be seen is how well they'll follow through on completing the tool, as well as integrating it with the various types of data sources & sinks which are of interest to users.

From a selfish point of view, I'd like to see people use tools like this to contribute to the availability of cleaned up public data sets rather than just using it to clean their private data silos. Of course, convincing people to do that is a much bigger problem -- one which the whole Linked Data / Semantic Web community has yet to come up with a compelling answer for.