Getting Wikipedia Summary from the Page ID


While working on my forthcoming checkin.to project, I needed to use the MediaWiki API to get the summary paragraph of wikipedia articles pertaining to places. Checkin.to relies on the Yahoo Where On Earth Identifiers (woeid). Yahoo also conveniently offers a concordance API so from the woeid I get the Geonames ID and the Wikipedia page ID among other things. As far as I can tell, the MediaWiki API doesn’t allow you to request page content using the page ID so the first step here is to resolve the page id into a unique page title. This can be done using the query action like so:

http://en.wikipedia.org/w/api.php?action=query&pageids=49728&format=json

It gives a response resembling:

{"query":{"pages":{"49728":{"pageid":49728,"ns":0,"title":"San Francisco"}}}}

Step 2 is to get the actual page content. There are a variety of formats available including the raw wiki markup, but for my purpose the formatted HTML is much more useful. We also need to convert the spaces in the page title to underscores. The request looks like this:

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=San_Francisco&format=json

And a response resembling:

{"parse":{"text":{"*":"<div class=\"dablink\">This article is about the place in California. [...] "}}}

Step 3 is to parse the resulting article html and extract just the first body paragraph which typically summarizes the whole article. The problem here is that a bunch of other stuff including all the sidebar content comes before the first body paragraph and that other stuff itself can include p tags. jQuery is a big help here, as usual. First, lets wrap the entire resulting wiki page in a div element to give everything a root. Then we can first just the simplings of that wrapper element to find the first root level p tag.

wikipage = $("<div>"+data.parse.text['*']+"<div>").children('p:first');

Below I have the entire resulting function that goes from page id to summary paragraph and appends it to a <div> somewhere in my DOM called #wiki_container. I also perform some optional cleanup including removing citations, updating the relative hrefs to absolute hrefs pointing to http://en.wikipedia.org, and adding a read more link.

function getAreaMetaInfo_Wikipedia(page_id) {
  $.ajax({
    url: 'http://en.wikipedia.org/w/api.php',
    data: {
      action:'query',
      pageids:page_id,
      format:'json'
    },
    dataType:'jsonp',
    success: function(data) {
      title = data.query.pages[page_id].title.replace(' ','_');
      $.ajax({
        url: 'http://en.wikipedia.org/w/api.php',
        data: {
          action:'parse',
          prop:'text',
          page:title,
          format:'json'
        },
        dataType:'jsonp',
        success: function(data) {
          wikipage = $("<div>"+data.parse.text['*']+"</div>").children('p:first');
          wikipage.find('sup').remove();
          wikipage.find('a').each(function() {
            $(this)
              .attr('href', 'http://en.wikipedia.org'+$(this).attr('href'))
              .attr('target','wikipedia');
          });
          $("#wiki_container").append(wikipage);
          $("#wiki_container").append("<a href='http://en.wikipedia.org/wiki/"+title+"' target='wikipedia'>Read more on Wikipedia</a>");
        }
      });
    }
  });
}

8 responses to “Getting Wikipedia Summary from the Page ID”

Leave a Reply to Zac Witte Cancel reply