Downloading only non-redirect titles from enwiki-latest-page.sql
The file enwiki-latest-all-titles-in-ns0
mentioned by svick contains both "actual pages" and redirects, e.g. both:
are present, and the first redirects to the second.
The file enwiki-latest-page.sql
however, which can be found from the dump listing at: https://dumps.wikimedia.org/enwiki/latest/ appears to contain enough information to get the titles without redirects.
On Ubuntu 23.04, we can extract the titles to a titles.txt
file with something like:
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
gunzip --keep enwiki-latest-page.sql.gz
sed -E '/^ (UNIQUE )?KEY/d;;/^ PRIMARY KEY/{s/,$//}' enwiki-latest-page.sql \
> enwiki-latest-page-noindex.sql
sudo apt update
sudo apt install mariadb-server
sudo mariadb -e 'create database enwiki'
sudo mariadb enwiki <enwiki-latest-page-noindex.sql
sudo mariadb enwiki -B -N -e 'select page_title from page
where page_namespace = 0 and
page_is_redirect = 0
order by page_title' > titles.txt
rm enwiki-latest-page.sql.gz enwiki-latest-page.sql enwiki-latest-page-noindex.sql
The SQL execution was quite slow and took about 17 minutes.
As of October 2023, according to https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia wikipedia has ~6.7 M articles, and titles.txt also contains a number of linesof that order, so the results seem plausible.
We can see that enwiki-latest-page.sql
creates the table:
CREATE TABLE `page` (
`page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
`page_namespace` int(11) NOT NULL DEFAULT 0,
`page_title` varbinary(255) NOT NULL DEFAULT '',
`page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_random` double unsigned NOT NULL DEFAULT 0,
`page_touched` binary(14) NOT NULL,
`page_links_updated` varbinary(14) DEFAULT NULL,
`page_latest` int(8) unsigned NOT NULL DEFAULT 0,
`page_len` int(8) unsigned NOT NULL DEFAULT 0,
`page_content_model` varbinary(32) DEFAULT NULL,
`page_lang` varbinary(35) DEFAULT NULL,
PRIMARY KEY (`page_id`),
UNIQUE KEY `page_name_title` (`page_namespace`,`page_title`),
KEY `page_random` (`page_random`),
KEY `page_len` (`page_len`),
KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=74951970 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
followed by a bunch of huge INSERT
statements.
As we can see, this contains both a title
and a is_redirect
field as we'd like. That table is an actual table from the MediaWiki software, and it is documented at: https://www.mediawiki.org/wiki/Manual:Page_table
One additional annoyance is that the presence of several indices make the initial import very very slow, so we speed that up greatly by editing the table with sed
to remove all indices except the primary key one. The import step took 17 minutes on my Lenovo ThinkPad P51 after these were removed. With the statements it didn't finish after 5 hours and I gave up. The following statement:
/*!40000 ALTER TABLE `page` DISABLE KEYS */;
does get executed on my system as per What does /*!xxxxxx statement */ generated by mysqldump mean? and presumably is added by mysqldump to speed up insertions, as it is faster to only update indices once after inserts. So it seems that the index generation itself is super slow at the end?
The specific case of titles without redirects has been asked at: titles of all wikipedia articles without redirect
A related Wikipedia import question: Importing wikipedia dump to MySql
Just one more timing experiment, if I hack it to not have any indices, not even PK:
echo page
cat <<'EOF' | time mariadb enwiki
DROP TABLE IF EXISTS `page`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `page` (
`page_id` int(8) unsigned NOT NULL,
`page_namespace` int(11) NOT NULL DEFAULT 0,
`page_title` varbinary(255) NOT NULL DEFAULT '',
`page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_random` double unsigned NOT NULL DEFAULT 0,
`page_touched` binary(14) NOT NULL,
`page_links_updated` varbinary(14) DEFAULT NULL,
`page_latest` int(8) unsigned NOT NULL DEFAULT 0,
`page_len` int(8) unsigned NOT NULL DEFAULT 0,
`page_content_model` varbinary(32) DEFAULT NULL,
`page_lang` varbinary(35) DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=74951970 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
/*!40101 SET character_set_client = @saved_cs_client */;
EOF
echo 'insert'
grep -a -e '^INSERT ' enwiki-latest-page.sql | time mariadb enwiki
echo 'index'
time mariadb enwiki -e 'ALTER TABLE `page` ADD PRIMARY KEY `page_pk` (`page_id`)'
time mariadb enwiki -e 'ALTER TABLE `page` ADD UNIQUE KEY `page_name_title` (`page_namespace`,`page_title`)'
#time mariadb enwiki -e 'ALTER TABLE `page` ADD KEY `page_random` (`page_random`)'
#time mariadb enwiki -e 'ALTER TABLE `page` ADD KEY `page_len` (`page_len`)'
#time mariadb enwiki -e 'ALTER TABLE `page` ADD KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)'
echo
gives:
- 28 minutes for insert
- 20 minutes for add primary key
- 7 minutes for
page_name_title
I'm losing patience with MySQL speeds.
I also tried to create a SQLite table manually and run the INSERT
unmodified, but it doesn't work because MySQL is using the \'
SQL extension to escape single quotes instead of ''
. This standard...
sed the enwiki-latest-page.sql
file instead to go faster
If you really want to git rid of MySQL slowness and live the fast life of sed
, here's an attempt:
cat enwiki-latest-page.sql |
grep -E '^INSERT' |
sed 's/),(/\n/g' |
grep -E '^[^,]+,0,' |
perl -lane "s/^[^']+'//;s/(,[^,]+){8}$//; if (substr(\$_, -1) == '0') { print substr(\$_, 0, -3) }" |
sed 's/\\\\/\\/g;s/\\'"'"'/'"'"'/g;s/\\"/"/g' \
> titles-sed.txt
sort -o titles-sed.txt titles-sed.txt
that took about 5 minutes on my Lenovo ThinkPad P51 and it produced the exact same text file except for the handling of the very few titles that contain a backslash \
, e.g.: https://en.wikipedia.org/wiki/TBWA%5CChiat%5CDay For some reason the MySQL output contains double backslash. Not sure if it is hardcoded in the database itself, or an IO problem. Looking at enwiki-latest-all-titles-in-ns0
I see that it contains TBWA/Chiat/Day
with single backslashes, so it must have been a simple IO issue.
I later learned that this could be much improved with awk FPAT
along: What's the most robust way to efficiently parse CSV using awk?
Here's an upload of the sed
one: https://archive.org/details/enwiki-latest-all-titles-in-ns0-no-redirects
mysqldump-to-csv
the enwiki-latest-page.sql
file instead to go faster
https://mcmap.net/q/209527/-dump-all-tables-in-csv-format-using-39-mysqldump-39 pointed me to https://github.com/jamesmishra/mysqldump-to-csv which semi hackily converts the dumps to CSV, which allows one to bypass the MySQL slowness:
git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
# listed in sys.argv[1:]
# or stdin if no args given.
try:
- for line in fileinput.input():
+ sys.stdin.reconfigure(errors='ignore')
+ for line in fileinput.input(encoding="utf-8", errors="ignore"):
# Look for an INSERT statement and parse it.
if is_insert(line):
values = get_values(line)
EOF
Then I add csvtool as per How to extract one column of a csv file + awk to filter the columns:
time zcat enwiki-latest-page.sql.gz |
python mysqldump-to-csv/mysqldump_to_csv.py |
csvtool format '%(2) %(4) %(3)\n' - |
awk '$1==0 && $2==0 {print $3}' > titles-csv.txt
sort -o titles-csv.txt titles-csv.txt
This took 5.5 minutes on my Lenovo ThinkPad P51, and the outcome is identical to the sed approach which is reassuring. So this is as fast but way less hacky, and will be able to deal with raw newlines for other tables that have them.
Get pages within the category hierarchy
I did some exploration on this related topic at: Wikipedia Category Hierarchy from dumps*