Finding and downloading images within the Wikipedia Dump
Asked Answered
P

1

7

I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here:

http://dumps.wikimedia.org/enwiki/latest/

And studied the DB schema:

http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png

I think I understand it but when I pick a sample image from a wikipedia page I can't find it anywhere in the dumps. For example:

http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG

I've done a grep on the dumps 'image', 'imagelinks', and 'page' looking for 'Carrizo_2a.JPG' and it's not found.

Are these dumps not complete? Am I misunderstanding the structure? Is there a better way to do this?

Also, to jump ahead one step: after I have filtered my list down and I want to download a bulk set of images (thousands) I saw some mentions that I need to do this from a mirror of the site to prevent overloading wikipedia/wikimedia. If has any guidance on this too, that would be helpful.

Pallet answered 5/4, 2013 at 21:50 Comment(1)
Here is an example of a second image that exhibits the same symptoms. I've tried a bunch and haven't found a single one yet which is in the dumps. en.wikipedia.org/wiki/File:Aerial-SanAndreas-CarrizoPlain.jpgPallet
D
11

MediaWiki stores file data in two or three places, depending on how you count:

  • The actual metadata for current file versions is stored in the image table. This is probably what you primarily want; you'll find the latest en.wikipedia dump of it here.

  • Data for old superseded file revisions is moved to the oldimage table, which has basically the same structure as the image table. This table is also dumped, the latest one is here.

  • Finally, each file also (normally) corresponds to a pretty much ordinary wiki page in namespace 6 (File:). You'll find the text of these in the XML dumps, same as for any other pages.

Oh, and the reason you're not finding those files you linked to in the English Wikipedia dumps is that they're from the shared repository at Wikimedia Commons. You'll find them in the Commons data dumps instead.

As for downloading the actual files, here's the (apparently) official documentation. As far as I can tell, all they mean by "Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers." is that if you want all the images in a tarball, you'll have to use a mirror. If you're only pulling a relatively small subset of the millions on images on Wikipedia and/or Commons, it should be fine to use the Wikimedia servers directly.

Just remember to exercise basic courtesy: send a user-agent string identifying yourself and don't hit the servers too hard. In particular, I'd recommend running the downloads sequentially, so that you only start downloading the next file after you've finished the previous one. Not only is that easier to implement than parallel downloading anyway, but it ensures that you don't hog more than your share of the bandwidth and allows the download speed to more or less automatically adapt to server load.

Ps. Whether you download the files from a mirror or directly from the Wikimedia servers, your going to need to figure out which directory they're in. Typical Wikipedia file URLs look like this:

http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg

where the "wikipedia/en" part identifies the Wikimedia project and language (for historical reasons, Commons is listed as "wikipedia/commons") and the "a/ab" part is given by the first two hex digits of the MD5 hash of the filename in UTF-8 (as they're encoded in the database dumps).

Distributive answered 5/4, 2013 at 22:12 Comment(4)
Thanks so much! I haven't gotten to the bulk download part yet but I didn't realize there were two separate repositories of data. I'm importing both of them right now but a quick 'grep' confirmed that my missing files were in the commons one. Wikipedia/media sure doesn't make understanding this stuff easy. :)Pallet
Everything is going smoothly except I'm trying to figure out how to filter the images I'm selecting by what's in the public domain. I can't find this info in the 'image' table or 'page' table. I think it's probably only in the contents of the page itself. e.g. See in the "Licensing" section of this page: en.wikipedia.org/wiki/File:Carrizo_2a.JPG So I'm downloading this file: dumps.wikimedia.org/enwiki/latest/… But I'm hoping to find a SQL version of this for easier manipulation. Any suggestions? And am I on the right track here?Pallet
I should add, on this page it says "SQL files for all pages and links are also available." That's what gave me the clue that they probably exist somewhere.Pallet
Yeah, MediaWiki's system for storing license metadata (or, rather, the lack of any such system) sucks. At least for Commons, you might be able to extract the license data from the categorylinks table dump, since all Commons license templates add the file pages they're used on to hidden categories under commons.wikimedia.org/wiki/Category:Copyright_statuses . I believe the English Wikipedia has a similar system, with the root category at en.wikipedia.org/wiki/… .Distributive

© 2022 - 2024 — McMap. All rights reserved.