How to obtain a list of titles of all Wikipedia articles
Asked Answered
M

4

40

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

So my question is

  1. Is there a way to obtain only the titles of Wikipedia articles via the API?
  2. Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?
Mailable answered 29/6, 2014 at 8:10 Comment(1)
You could try the API Sandbox or an actual queryConnolly
S
59

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

Sperrylite answered 29/6, 2014 at 8:39 Comment(8)
Awesome, thanks a lot! I was looking for exactly such a dump but wasn't able to find one. I guess one click further in searching for a dump would have brought me to exactly this download :) thanks!Mailable
This helped us. Can you give link of page which has list of all dumps?Tragedy
@VivekSancheti Here is the page listing all English Wikipedia dumps from last month.Sperrylite
What is the difference between the in-ns0 and the non-in-ns0 .gz file? They differ in size as well..Romney
@Romney The difference is that "in-ns0" only contains information about pages in namespace 0, that is articles.Sperrylite
Thanks @svick! Is it possible to get the same data dump for other languages? And also a way to link the titles in diff languages of the same article together (Via article ID, for example)?Coverlet
all-titles-in-ns0 contains titles that are just redirects to the real title.Calisaya
@Romney the namespaces are documented at: en.wikipedia.org/wiki/Wikipedia:Namespace 0 is "main articles" (and reidrects), and others are "magic stuff" like Talk:, File:, User: and Category:.Miscarry
M
4

Downloading only non-redirect titles from enwiki-latest-page.sql

The file enwiki-latest-all-titles-in-ns0 mentioned by svick contains both "actual pages" and redirects, e.g. both:

are present, and the first redirects to the second.

The file enwiki-latest-page.sql however, which can be found from the dump listing at: https://dumps.wikimedia.org/enwiki/latest/ appears to contain enough information to get the titles without redirects.

On Ubuntu 23.04, we can extract the titles to a titles.txt file with something like:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
gunzip --keep enwiki-latest-page.sql.gz
sed -E '/^  (UNIQUE )?KEY/d;;/^  PRIMARY KEY/{s/,$//}' enwiki-latest-page.sql \
  > enwiki-latest-page-noindex.sql
sudo apt update
sudo apt install mariadb-server
sudo mariadb -e 'create database enwiki'
sudo mariadb enwiki <enwiki-latest-page-noindex.sql
sudo mariadb enwiki -B -N -e 'select page_title from page
  where page_namespace = 0 and
  page_is_redirect = 0
  order by page_title' > titles.txt
rm enwiki-latest-page.sql.gz enwiki-latest-page.sql enwiki-latest-page-noindex.sql

The SQL execution was quite slow and took about 17 minutes.

As of October 2023, according to https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia wikipedia has ~6.7 M articles, and titles.txt also contains a number of linesof that order, so the results seem plausible.

We can see that enwiki-latest-page.sql creates the table:

CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `page_namespace` int(11) NOT NULL DEFAULT 0,
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_random` double unsigned NOT NULL DEFAULT 0,
  `page_touched` binary(14) NOT NULL,
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT 0,
  `page_len` int(8) unsigned NOT NULL DEFAULT 0,
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL,
  PRIMARY KEY (`page_id`),
  UNIQUE KEY `page_name_title` (`page_namespace`,`page_title`),
  KEY `page_random` (`page_random`),
  KEY `page_len` (`page_len`),
  KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=74951970 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;

followed by a bunch of huge INSERT statements.

As we can see, this contains both a title and a is_redirect field as we'd like. That table is an actual table from the MediaWiki software, and it is documented at: https://www.mediawiki.org/wiki/Manual:Page_table

One additional annoyance is that the presence of several indices make the initial import very very slow, so we speed that up greatly by editing the table with sed to remove all indices except the primary key one. The import step took 17 minutes on my Lenovo ThinkPad P51 after these were removed. With the statements it didn't finish after 5 hours and I gave up. The following statement:

/*!40000 ALTER TABLE `page` DISABLE KEYS */;

does get executed on my system as per What does /*!xxxxxx statement */ generated by mysqldump mean? and presumably is added by mysqldump to speed up insertions, as it is faster to only update indices once after inserts. So it seems that the index generation itself is super slow at the end?

The specific case of titles without redirects has been asked at: titles of all wikipedia articles without redirect

A related Wikipedia import question: Importing wikipedia dump to MySql

Just one more timing experiment, if I hack it to not have any indices, not even PK:

echo page
cat <<'EOF' | time mariadb enwiki
DROP TABLE IF EXISTS `page`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL,
  `page_namespace` int(11) NOT NULL DEFAULT 0,
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_random` double unsigned NOT NULL DEFAULT 0,
  `page_touched` binary(14) NOT NULL,
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT 0,
  `page_len` int(8) unsigned NOT NULL DEFAULT 0,
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=74951970 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
/*!40101 SET character_set_client = @saved_cs_client */;
EOF
echo 'insert'
grep -a -e '^INSERT ' enwiki-latest-page.sql | time mariadb enwiki
echo 'index'
time mariadb enwiki -e 'ALTER TABLE `page` ADD PRIMARY KEY `page_pk` (`page_id`)'
time mariadb enwiki -e 'ALTER TABLE `page` ADD UNIQUE KEY `page_name_title` (`page_namespace`,`page_title`)'
#time mariadb enwiki -e 'ALTER TABLE `page` ADD KEY `page_random` (`page_random`)'
#time mariadb enwiki -e 'ALTER TABLE `page` ADD KEY `page_len` (`page_len`)'
#time mariadb enwiki -e 'ALTER TABLE `page` ADD KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)'
echo

gives:

  • 28 minutes for insert
  • 20 minutes for add primary key
  • 7 minutes for page_name_title

I'm losing patience with MySQL speeds.

I also tried to create a SQLite table manually and run the INSERT unmodified, but it doesn't work because MySQL is using the \' SQL extension to escape single quotes instead of ''. This standard...

sed the enwiki-latest-page.sql file instead to go faster

If you really want to git rid of MySQL slowness and live the fast life of sed, here's an attempt:

cat enwiki-latest-page.sql |
  grep -E '^INSERT' |
  sed 's/),(/\n/g' |
  grep -E '^[^,]+,0,' |
  perl -lane "s/^[^']+'//;s/(,[^,]+){8}$//; if (substr(\$_, -1) == '0') { print substr(\$_, 0, -3) }" |
  sed 's/\\\\/\\/g;s/\\'"'"'/'"'"'/g;s/\\"/"/g' \
> titles-sed.txt
sort -o titles-sed.txt titles-sed.txt

that took about 5 minutes on my Lenovo ThinkPad P51 and it produced the exact same text file except for the handling of the very few titles that contain a backslash \, e.g.: https://en.wikipedia.org/wiki/TBWA%5CChiat%5CDay For some reason the MySQL output contains double backslash. Not sure if it is hardcoded in the database itself, or an IO problem. Looking at enwiki-latest-all-titles-in-ns0 I see that it contains TBWA/Chiat/Day with single backslashes, so it must have been a simple IO issue.

I later learned that this could be much improved with awk FPAT along: What's the most robust way to efficiently parse CSV using awk?

Here's an upload of the sed one: https://archive.org/details/enwiki-latest-all-titles-in-ns0-no-redirects

mysqldump-to-csv the enwiki-latest-page.sql file instead to go faster

https://mcmap.net/q/209527/-dump-all-tables-in-csv-format-using-39-mysqldump-39 pointed me to https://github.com/jamesmishra/mysqldump-to-csv which semi hackily converts the dumps to CSV, which allows one to bypass the MySQL slowness:

git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
     # listed in sys.argv[1:]
     # or stdin if no args given.
     try:
-        for line in fileinput.input():
+        sys.stdin.reconfigure(errors='ignore')
+        for line in fileinput.input(encoding="utf-8", errors="ignore"):
             # Look for an INSERT statement and parse it.
             if is_insert(line):
                 values = get_values(line)
EOF

Then I add csvtool as per How to extract one column of a csv file + awk to filter the columns:

time zcat enwiki-latest-page.sql.gz |
  python mysqldump-to-csv/mysqldump_to_csv.py |
  csvtool format '%(2) %(4) %(3)\n' - |
  awk '$1==0 && $2==0 {print $3}' > titles-csv.txt
sort -o titles-csv.txt titles-csv.txt

This took 5.5 minutes on my Lenovo ThinkPad P51, and the outcome is identical to the sed approach which is reassuring. So this is as fast but way less hacky, and will be able to deal with raw newlines for other tables that have them.

Get pages within the category hierarchy

I did some exploration on this related topic at: Wikipedia Category Hierarchy from dumps*

Miscarry answered 7/10, 2023 at 8:11 Comment(0)
H
3

Right now, as per the current statistics the number of articles is around 5.8M. To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:

# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki

listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")


wiki = MediaWiki('https://en.wikipedia.org/w/api.php')

continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'

pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']

for eachPage in pagesInQuery:
    pageId = eachPage['pageid']
    title = eachPage['title'].encode('utf-8')
    writestr = str(pageId) + "; " + title + "\n"
    listOfPagesFile.write(writestr)

numQueries = 1

while len(pagelist['query']['allpages']) > 0:

    requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
    pagelist = wiki.call(requestObj)


    pagesInQuery = pagelist['query']['allpages']

    for eachPage in pagesInQuery:
        pageId = eachPage['pageid']
        title = eachPage['title'].encode('utf-8')
        writestr = str(pageId) + "; " + title + "\n"
        listOfPagesFile.write(writestr)
        # print writestr


    numQueries += 1

    if numQueries % 100 == 0:
        print "Done with queries -- ", numQueries
        print numQueries

listOfPagesFile.close()

The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.

I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.

I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:

requestObj['apfilterredir'] = 'nonredirects'

After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.

With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.

Is there any other option that I should be trying to get the actual (~5.8M) set of page names?

Havens answered 12/3, 2019 at 8:54 Comment(2)
Is simplemediawiki Python 3 or is it Python 2?Unqualified
If you are getting an error about the print statement, you can avoid that by installing directly from GitHub rather than PyPI: pip install pip install git+https://github.com/iliana/python-simplemediawiki.gitShot
C
-1

Here is an asynchronous program that will generate mediawiki pages titles:

async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue
Cornute answered 14/8, 2021 at 15:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.