How can I get the principal image from MediaWiki API?

N

5

Hello I'm using Curl to get information from Wikipedia,and I want to receive only information about the principal image,I don't want to receive all images of an article.. For example.. If I want to get info about all images of the English Language (http://en.wikipedia.org/wiki/English_language) I should go to this URL: http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&prop=images but I receive flags of countries where people speak English in XML:

<?xml version="1.0"?> <api>   <query>
    <normalized>
      <n from="English_language" to="English language" />
    </normalized>
    <pages>
      <page pageid="8569916" ns="0" title="English language">
        <images>
          <im ns="6" title="File:Anglospeak(800px)Countries.png" />
          <im ns="6" title="File:Anglospeak.svg" />
          <im ns="6" title="File:Circle frame.svg" />
          <im ns="6" title="File:Commons-logo.svg" />
          <im ns="6" title="File:Flag of Argentina.svg" />
          <im ns="6" title="File:Flag of Aruba.svg" />
          <im ns="6" title="File:Flag of Australia.svg" />
          <im ns="6" title="File:Flag of Bolivia.svg" />
          <im ns="6" title="File:Flag of Brazil.svg" />
          <im ns="6" title="File:Flag of Canada.svg" />

I only want the information about the principal image.

Nalley answered 27/8, 2012 at 18:57 Comment(1)

what images do you expect to get? Aren't these the images that appeared in the wiki page about the english language? Wiki data isn't structured in a way to identify an image about the "english language" but you can check out projects like dbpedia.org which might help. – Bareilly 27/8, 2012 at 19:10

H

0

As others have noted, Wikipedia articles don't really have any such thing as a "principal image", so your first problem will be deciding how to choose between the different images used on a given page. Some possible selection criteria might be:

Biggest image in the article.
First image exceeding some specific minimum dimensions, e.g. 60 × 60 pixels.
First image referenced directly in the article's source text, rather than through a template.

For the first two options, you'll want to fetch the rendered HTML code of the page via action=parse and use an HTML parser to find the img tags in the code, like this:

http://en.wikipedia.org/w/api.php?action=parse&page=English_language&prop=text|images

(The reason you can't just get the sizes of the images, as used on the page, directly from the API is that that information isn't actually stored anywhere in the MediaWiki database.)

For the last option, what you want is the source wikitext of the article, available via prop=revisions with rvprop=content:

http://en.wikipedia.org/w/api.php?action=query&titles=English_language&prop=revisions|images&rvprop=content

Note that many images in infoboxes and such are specified as parameters to a template, so just parsing for [[Image:...]] syntax will miss some of them. A better solution is probably to just get the list of all images used on the page via prop=images (which you can do in the same query, as I showed above) and look for their names (with or without Image: / File: prefix) in the wikitext.

Keep in mind the various ways in which MediaWiki automatically normalizes page (and image) names: most notably, underscores are mapped to spaces, consecutive whitespace is collapsed to a single space and the first letter of the name is capitalized. If you decide to go this way, here's some sample PHP code that will convert a list of file names into a regexp that should match any of them in wikitext:

foreach ($names as &$name) {
    $name = trim( preg_replace( '/[_\s]+/u', ' ', $name ) );
    $name = preg_quote( $name, '/' );
    $name = preg_replace( '/^(\\\\?.)/us', '(?i:$1)', $name );
    $name = preg_replace( '/\\\\? /u', '[_\s]+', $name );
}
$regexp = '/' . implode( '|', $names ) . '/u';

For example, when given the list:

Anglospeak(800px)Countries.png
Anglospeak.svg
Circle frame.svg
Commons-logo.svg
Flag of Argentina.svg
Flag of Aruba.svg

the generated regexp will be:

/(?i:A)nglospeak\(800px\)Countries\.png|(?i:A)nglospeak\.svg|(?i:C)ircle[_\s]+frame\.svg|(?i:C)ommons\-logo\.svg|(?i:F)lag[_\s]+of[_\s]+Argentina\.svg|(?i:F)lag[_\s]+of[_\s]+Aruba\.svg/u

Heiskell answered 28/8, 2012 at 22:26 Comment(1)

There is a way to get the "principal image" from Wikipedia using the WikiMedia API. Please see https://mcmap.net/q/264542/-accessing-main-picture-of-wikipedia-page-by-api for the solution. – Morez 27/3, 2017 at 7:24

M

7

There's news! _{^{(from 2014)}}
A new extension, PageImages, is available and also got already installed on the Wikimedia wikis.

Instead of prop=images, use prop=pageimages, and you'll get a pageimage attribute and a <thumbnail> child node for each <page> element.

Admittedly, it's not guaranteed to give the best results, but in your example (English Language) it works well and only yields the map of the geographic distribution, not all the flags.

Also, the OpenSearch API does return an <image> in it's xml representation, but this API is not usable with lists and cannot be combine with the Query API.

Marabout answered 16/7, 2014 at 7:32 Comment(0)

L

3

This is how I got it working...

$.getJSON("http://en.wikipedia.org/w/api.php?action=query&format=json&callback=?", {
    titles: "India",
    prop: "pageimages",
    pithumbsize: 150
  },
  function(data) {
    var source = "";
    var imageUrl = GetAttributeValue(data.query.pages);
    if (imageUrl == "") {
      $("#wiki").append("<div>No image found</div>");
    } else {
      var img = "<img src=\"" + imageUrl + "\">"
      $("#wiki").append(img);
    }
  }
);

 function GetAttributeValue(data) {
  var urli = "";
  for (var key in data) {
    if (data[key].thumbnail != undefined) {
      if (data[key].thumbnail.source != undefined) {
        urli = data[key].thumbnail.source;
        break;
      }
    }
  }
  return urli;
}



<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<html>

<head></head>

<body>
  <div id="wiki"></div>
</body>

</html>

Leblanc answered 12/1, 2015 at 11:20 Comment(0)

G

1

Important addendum

Bergi's answer, above, seemed super great, but I was bashing my head out because I couldn't get it to work.

I needed to include pilicense=any in my query, because otherwise any copyrighted imagery was ignored.

Here's the query I ultimately got working:

https://en.wikipedia.org/w/api.php?action=query&pilicense=any&format=jsonfm&prop=pageimages&generator=search&gsrsearch=My+incategory:English-language_films+prefix:My&gsrlimit=3

I know it's been awhile, but this is one of the first pages I landed on when I started my days-long search for how to do this, so I wanted to share this specifically on this page, for others like me who might come here.

Got answered 8/5, 2018 at 15:57 Comment(0)

M

0

You can limit your query to the first image in the article with the imlimit parameter:

http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&redirects&prop=images&imlimit=1

Marabout answered 27/8, 2012 at 19:5 Comment(2)

Thanks, but how could I get only the principal image? not always the first image is the main image on wikipedia – Nalley 28/8, 2012 at 9:28

There is no "only principal" image for an article, such information does not exist and cannot be obtained by the API. Check out dbpedia.org, but afaik the use the first one, too. You might exclude things like flags or disambiguation icons from your results manually. – Marabout 28/8, 2012 at 10:36

H

0