Plotly

03 October 2012

Pywikipedia and author identifiers in Wikipedia

I'll admit that sounds like a strange combination of topics, but I'll explain.  I recently saw a mention of VIAF identifiers being added to Wikipedia articles.  That's interesting because VIAF is a union catalog of the world's major libraries' "authority files" (their term for a list of names they control) and using a VIAF id would allow you to bridge to all the constituent catalogs.  It's also one of the identifiers associated with Freebase topics.

When I looked into it, I discovered that Wikipedians had already been adding Library of Congress identifiers, so the VIAF id was just an incremental improvement.  The VIAF additions were supposed to have been done in August, so I wanted to see how many there were compared to the older LC identifiers.

In the past I've written little custom Python programs to query information like this, but I recently came across pywikipedia (aka PyWikipediaBot) which is perfect for tasks like this.  It's got a program which will count template transclusion (ie pages that include a template) as a standard component.  You give it the name of a template, tell it whether you want a list or just a count and it'll query the Wikipedia API to get your results.

$ python templatecount.py -count -namespace:0 Authority_control
Getting references to [[Template:Authority control]] via API...
...
Number of transclusions per template
------------------------------------
Authority_control: 5183

Hmm, that's not as many as I'd hoped.  We selected namespace 0 to restrict our count to the main articles as opposed to talk pages, user pages, etc.  If we replace -count with -list, we can get a list of all the articles. The first time you run any of the pywikipedia tools it'll ask you a few questions to establish defaults for wiki family (wikipedia, wikitravel, etc), language, username, etc, but these can all be overridden on the command line.

The tool allows you to qualify a template name with a parameter name, so we can look at the breakdown between VIAF and LCCN.

$ python templatecount.py -count -namespace:0 \  
    Authority_control/VIAF Authority_control/LCCN
Authority_control/VIAF: 3569
Authority_control/LCCN: 4122

So it looks like there are roughly equal numbers of each and, based on the total count, most templates probably contain both.

One of the things that I noticed when looking at the Template:Authority_control documentation is that Normdaten is an alias for it and looking at the counts shows it's actually used.

$ python templatecount.py -count -namespace:0 Normdaten
Normdaten: 1227

That's interesting.  I wonder what the story behind that is?  Naturally the mind immediately wanders to German Wikipedia.  I wonder if that template is used there and, if so, how frequently.  Fortunately for us, the tool can query a different Wikipedia with the flick of a switch by adding -lang:de.

$ python templatecount.py -count -namespace:0 -lang:de Normdaten

Normdaten: 254890

Wow, a quarter million identifiers! That's more like what I was hoping for.  German Wikipedia is much further ahead in adding strong identifiers to their articles.  They started with a big push in 2010 and have been steadily adding them ever since as you can see from this graph.

Strong idenifiers in German Wikipedia

Next up -- how to actually retrieve template parameter values...

No comments: