How the data is encoded
You need two tables:
The way things are linked up is:
categorylinks.cl_from
(integer) points to the page.page_id
(unique integer) of the child page or category
categorylinks.cl_to
(text) points to the page.page_title
(text, unique together with namespace) of the parent category
Here's a concrete example:
mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Albert_Einstein')"
gives:
+---------+----------------+-----------------+------------------+
| page_id | page_namespace | page_title | page_is_redirect |
+---------+----------------+-----------------+------------------+
| 736 | 0 | Albert_Einstein | 0 |
| 2504643 | 14 | Albert_Einstein | 0 |
+---------+----------------+-----------------+------------------+
so we see that there are there is both:
The namespace IDs are listed at: https://en.wikipedia.org/wiki/Wikipedia:Namespace
Next:
mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (736, 2504643)"
gives:
+---------+--------------------------------------------------------------------------------------------------------------+
| cl_from | cl_to |
+---------+--------------------------------------------------------------------------------------------------------------+
| 736 | 1879_births |
| 736 | 1955_deaths |
| 736 | 19th-century_German_Jews |
| 736 | 20th-century_American_engineers |
...
| 736 | Winners_of_the_Max_Planck_Medal |
| 736 | Württemberger_emigrants_to_the_United_States |
| 2504643 | Commons_category_link_is_on_Wikidata |
| 2504643 | Wikipedia_categories_named_after_American_scientists |
| 2504643 | Wikipedia_categories_named_after_German_scientists |
| 2504643 | Wikipedia_categories_named_after_Swiss_people |
| 2504643 | Wikipedia_categories_named_after_physicists |
+---------+--------------------------------------------------------------------------------------------------------------+
therefore, e.g.:
Actually getting the tree
There many possible approaches to getting the data out, this was my path of lowest resistance.
MySQL was feeling a bit slower than it should, so I just reverted to my beloved SQLite: How to obtain a list of titles of all Wikipedia articles
PostgreSQL and MySQL will have better recursive query support with arrays which allow for nicer:
but it was feeling like a Python loop would provide greater flexibility anyways, so I just went for that.
I imported to SQLite with the help of https://github.com/jamesmishra/mysqldump-to-csv:
git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
# listed in sys.argv[1:]
# or stdin if no args given.
try:
- for line in fileinput.input():
+ sys.stdin.reconfigure(errors='ignore')
+ for line in fileinput.input(encoding="utf-8", errors="ignore"):
# Look for an INSERT statement and parse it.
if is_insert(line):
values = get_values(line)
EOF
The patch is to overcome binary data madness in categorylinks
: https://github.com/jamesmishra/mysqldump-to-csv/issues/17
Then I get the files:
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
and import into SQLite with:
import-sqlite.sh
db=enwiki.sqlite
rm -f "$db"
# page
sqlite3 "$db" 'create table "page"("page_id" integer, "page_namespace" integer, "page_title" text, "page_is_redirect" integer, "page_len" integer)'
time zcat enwiki-latest-page.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py | csvtool col 1,2,3,4,10 - | sqlite3 "$db" ".import --csv '|cat -' page"
du -h "$db"
time sqlite3 "$db" 'create unique index "page_id" on "page"("page_id")'
du -h "$db"
time sqlite3 "$db" 'create index "page_namespace_title" on "page"("page_namespace", "page_title")'
du -h "$db"
# categorylinks
sqlite3 "$db" 'create table categorylinks("cl_from" integer, "cl_to" text)'
time zcat enwiki-latest-categorylinks.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py | csvtool col 1,2 - | sqlite3 "$db" ".import --csv '|cat -' categorylinks"
du -h "$db"
time sqlite3 "$db" 'create index "categorylinks_to" on categorylinks("cl_to")'
du -h "$db"
On my Lenovo ThinkPad P51:
+ sqlite3 enwiki.sqlite '.import --csv '\''|cat -'\'' page'
real 6m25.156s
user 9m57.253s
sys 0m25.416s
+ du -h enwiki.sqlite
1.6G enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create unique index "page_id" on "page"("page_id")'
real 1m6.278s
user 0m28.639s
sys 0m35.041s
+ du -h enwiki.sqlite
2.0G enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create index "page_namespace_title" on "page"("page_namespace", "page_title")'
real 2m5.788s
user 1m18.303s
sys 0m25.823s
+ du -h enwiki.sqlite
3.0G enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create table categorylinks("cl_from" integer, "cl_to" text)'
+ zcat enwiki-latest-categorylinks.sql.gz
+ python mysqldump-to-csv/mysqldump_to_csv.py
+ csvtool col 1,2 -
+ sqlite3 enwiki.sqlite '.import --csv '\''|cat -'\'' categorylinks'
real 20m30.865s
user 29m13.953s
sys 1m6.885s
+ du -h enwiki.sqlite
6.8G enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create index "categorylinks_to" on categorylinks("cl_to")'
real 6m36.867s
user 3m12.710s
sys 2m13.543s
+ du -h enwiki.sqlite
8.1G enwiki.sqlite
Then I walk the category tree preorder depth first with this script and output a few graph representations:
sqlite_preorder.py
#!/usr/bin/env python
import argparse
import html
import sys
import sqlite3
import re
import os.path
from pathlib import Path
def escape_dot(s):
return s.replace('"', '\\"')
NAMESPACE_TO_TEXT = {
0: '',
14: 'Category:',
}
def ns_to_txt(ns):
if ns in NAMESPACE_TO_TEXT:
return NAMESPACE_TO_TEXT[ns]
else:
return str(ns)
parser = argparse.ArgumentParser()
parser.add_argument('-d', '--depth', type=int)
parser.add_argument('db')
parser.add_argument('titles', nargs='+')
args = parser.parse_args()
out_dot = True
out_txt = True
out_html = True
outdir = 'out'
con = sqlite3.connect(args.db)
cur = con.cursor()
todo = list(map(lambda title: (14, title, 0), args.titles.copy()))
main_title = args.titles[0]
main_title_human = main_title.replace('_', ' ') + ' - Wikipedia CatTree'
basename = os.path.join(outdir, main_title)
visited = set()
Path(outdir).mkdir(parents=True, exist_ok=True)
if out_txt:
out_txt_f = open(f'{basename}.txt', 'w')
if out_dot:
out_dot_f = open(f'{basename}.dot', 'w')
out_dot_f.write('digraph {\n')
if out_html:
out_html_f = open(f'{basename}.html', 'w')
out_html_f.write(f'''<!doctype html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>{main_title_human}</title>
<style>
a {{ text-decoration: none; }}
details {{ margin-left: 1em; }}
summary {{ margin-bottom: 0.4em; }}
</style>
</head>
<body>
<h1>{main_title_human}</h1>
''')
last_depth = 0
while len(todo):
namespace, title, depth = todo.pop()
depth_delta = depth - last_depth
if depth_delta <= 0:
repeat_close = -depth_delta + 1
else:
repeat_close = 0
last_depth = depth
path_last = ns_to_txt(namespace) + title
if out_txt:
out_txt_f.write('{}{} {}\n'.format(' ' * depth, depth, path_last))
if out_html:
out_html_f.write('</details>\n' * repeat_close)
out_html_f.write(f'<details open="true"><summary><a href="https://en.wikipedia.org/wiki/{html.escape(path_last)}">{html.escape(path_last.replace("_", " "))}</a></summary>\n')
visited.add((namespace, title))
print(len(visited), file=sys.stderr)
if namespace == 14:
for child_namespace, child_title in cur.execute('''
select page_namespace, page_title from categorylinks
inner join page on cl_from = page_id and cl_to = ?
order by page_namespace asc, page_title desc
''', (title,)).fetchall():
if not (child_namespace, child_title) in visited and not (args.depth is not None and depth == args.depth):
if out_dot:
out_dot_f.write('"{}{}"->"{}{}";\n'.format(ns_to_txt(namespace), escape_dot(title), ns_to_txt(child_namespace), escape_dot(child_title)))
todo.append((child_namespace, child_title, depth + 1))
if out_txt:
out_txt_f.close()
if out_dot:
out_dot_f.write('}\n')
out_dot_f.close()
if out_html:
out_html_f.write('</details>\n' * last_depth)
out_html_f.write('''</body>
</html>
''')
out_html_f.close
Sample usage to walk down from https://commons.wikimedia.org/wiki/Category:Mathematics up to 6 levels down
time ./sqlite_preorder.py -d6 enwiki.sqlite Mathematics
This walked 82460 pages in 2m30.895s.
It produces the following output files illustrating some basic visualization ideas:
out/Mathematics.html
: collapsible table of contents with details
and summary
HTML elements and clickable links to Wikipedia. I'm quite proud of this one! Live demo: https://cirosantilli.com/wikipedia-cattree/Mathematics
out/Mathematics.txt
: simple plaintext tree, e.g.:
0 Category:Mathematics
1 Category:Fields_of_mathematics
2 Category:Algebra
3 Category:Abstract_algebra
4 Category:Abstract_algebra_stubs
5 Category:Algebraic_geometry_stubs
6 10Algebraic-geometry-stub
6 3-fold
6 Abelian_surface
6 Abhyankar–Moh_theorem
6 Abundance_conjecture
out/Mathematics.dot
: Graphviz format, which is understood by a few different applications
Tested on Ubuntu 23.04.