Wikipedia Category Hierarchy from dumps
Asked Answered
S

3

21

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.

For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.

The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.

I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?

Spector answered 2/7, 2013 at 17:42 Comment(1)
Above link is not working.Baltazar
J
19

The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks dump.

You're also going to need the page (not pages-articles) dump for page id to title mapping.

Jointer answered 2/7, 2013 at 19:6 Comment(16)
Thanks! Been looking for that all night! When you said "page" you mean this one enwiki-latest-page.sql.gz? (dumps.wikimedia.org/enwiki/latest)Spector
@Spector Yeah, that's the one.Jointer
sorry for bothering again with this theme, I am working on it, but not getting what I expected as a result. Is this correct: From categoryLinks I get the pageId and it's categories. Some pages will also be categories, so connecting all links should result in a hierarchy of categories?Spector
I'm trying to do the same thing, but maybe the categorylinks schema has changed. It no longer has a pageId column. categorylinks now has [cl_from,cl_to,cl_sortkey,cl_timestamp,cl_sortkey_prefix,cl_collation,cl_type]. How can I build the hierarchy from this?Injury
@Injury If you look at the documentation for categorylinks, you'll see that the page_id is stored in the cl_from column (and always was).Jointer
@Jointer thank you. So categorylinks.cl_from = page.page_id. Where do I go from there? Let's take, for example, page_id=12, which is "Anarchism". There are 19 categorylinks where cl_from=12. The cl_to is a text field [Anarchism, Anti-capitalism,Anti-fascism,...]. How do I find the parent or child categories/pages?Injury
@Injury Find the page_id for the page with page_title = 'Category:Anarchism' and then look that up in categorylinks etc.Jointer
@Jointer Am I suppose to look up the cl_to values in category.cat_title? The issue is not all of the cl_to values are in category.cat_title. For example, there is a category.cat_title=Anarchism but none for Anti-capitalism. And it just seems odd they didn't list the category.cat_id insteadInjury
@Jointer In my 34M page table, there is not page_title='Category:Anarchism'. There are 9 entries where page_title='Anarchism' however. Do I have the wrong page table, maybe?Injury
@Jointer In fact, I only have 12 entries in page where page_title LIKE 'Category:%'Injury
@Injury 34MB? page.sql.gz for the English Wikipedia is 1 GB. Maybe you downloaded it for different wiki? And you don't need the category table at all (notice that I never mentioned it), and especially not the cat_id (it's not related to page_id).Jointer
Let us continue this discussion in chat.Jointer
@fersarr: Could you kindly let me know how you created the hierarchy?Ginaginder
Is there a way to obtain all the subcategories and pages from a Category using those dumps, or do I need more sql tables? @JointerIndividually
Since the OP has closed the chat.. here's how you can determine which categories are hidden :- "The status of hidden categories is stored in the page props table as the property "hiddencat" in pp_propname" P.S. Wikimedia has excellent description of all its tables :- mediawiki.org/wiki/Category:MediaWiki_database_tablesPaper
I needed the page and categorylink tables joined, so I used Quarry. Here's the result: quarry.wmcloud.org/query/73496. It has three columns: from, to, and relation. However, this only includes categories (no page categories), so all the relations are subcat.Roxie
B
2

Loading the dump of category links etc... to build a wikipedia hierarchy is very long (even if interesting).

I found fast path that give good result. I rely on wikipedia vital articles hierarchy. See for instance, sensimark for an example use.

Buchholz answered 14/9, 2018 at 9:41 Comment(0)
A
1

How the data is encoded

You need two tables:

The way things are linked up is:

  • categorylinks.cl_from (integer) points to the page.page_id (unique integer) of the child page or category
  • categorylinks.cl_to (text) points to the page.page_title (text, unique together with namespace) of the parent category

Here's a concrete example:

mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Albert_Einstein')"

gives:

+---------+----------------+-----------------+------------------+
| page_id | page_namespace | page_title      | page_is_redirect |
+---------+----------------+-----------------+------------------+
|     736 |              0 | Albert_Einstein |                0 |
| 2504643 |             14 | Albert_Einstein |                0 |
+---------+----------------+-----------------+------------------+

so we see that there are there is both:

The namespace IDs are listed at: https://en.wikipedia.org/wiki/Wikipedia:Namespace

Next:

mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (736, 2504643)"

gives:

+---------+--------------------------------------------------------------------------------------------------------------+                                                                    
| cl_from | cl_to                                                                                                        |                                                                    
+---------+--------------------------------------------------------------------------------------------------------------+                                                                    
|     736 | 1879_births                                                                                                  |                                                                    
|     736 | 1955_deaths                                                                                                  |                                                                    
|     736 | 19th-century_German_Jews                                                                                     |                                                                    
|     736 | 20th-century_American_engineers                                                                              |
...
|     736 | Winners_of_the_Max_Planck_Medal                                                                              |
|     736 | Württemberger_emigrants_to_the_United_States                                                                 |
| 2504643 | Commons_category_link_is_on_Wikidata                                                                         |
| 2504643 | Wikipedia_categories_named_after_American_scientists                                                         |
| 2504643 | Wikipedia_categories_named_after_German_scientists                                                           |
| 2504643 | Wikipedia_categories_named_after_Swiss_people                                                                |
| 2504643 | Wikipedia_categories_named_after_physicists                                                                  |
+---------+--------------------------------------------------------------------------------------------------------------+

therefore, e.g.:

Actually getting the tree

There many possible approaches to getting the data out, this was my path of lowest resistance.

MySQL was feeling a bit slower than it should, so I just reverted to my beloved SQLite: How to obtain a list of titles of all Wikipedia articles

PostgreSQL and MySQL will have better recursive query support with arrays which allow for nicer:

but it was feeling like a Python loop would provide greater flexibility anyways, so I just went for that.

I imported to SQLite with the help of https://github.com/jamesmishra/mysqldump-to-csv:

git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
     # listed in sys.argv[1:]
     # or stdin if no args given.
     try:
-        for line in fileinput.input():
+        sys.stdin.reconfigure(errors='ignore')
+        for line in fileinput.input(encoding="utf-8", errors="ignore"):
             # Look for an INSERT statement and parse it.
             if is_insert(line):
                 values = get_values(line)
EOF

The patch is to overcome binary data madness in categorylinks: https://github.com/jamesmishra/mysqldump-to-csv/issues/17

Then I get the files:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz

and import into SQLite with:

import-sqlite.sh

db=enwiki.sqlite
rm -f "$db"

# page
sqlite3 "$db" 'create table "page"("page_id" integer, "page_namespace" integer, "page_title" text, "page_is_redirect" integer, "page_len" integer)'
time zcat enwiki-latest-page.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py | csvtool col 1,2,3,4,10 - | sqlite3 "$db" ".import --csv '|cat -' page"
du -h "$db"
time sqlite3 "$db" 'create unique index "page_id" on "page"("page_id")'
du -h "$db"
time sqlite3 "$db" 'create index "page_namespace_title" on "page"("page_namespace", "page_title")'
du -h "$db"

# categorylinks
sqlite3 "$db" 'create table categorylinks("cl_from" integer, "cl_to" text)'
time zcat enwiki-latest-categorylinks.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py | csvtool col 1,2 - | sqlite3 "$db" ".import --csv '|cat -' categorylinks"
du -h "$db"
time sqlite3 "$db" 'create index "categorylinks_to" on categorylinks("cl_to")'
du -h "$db"

On my Lenovo ThinkPad P51:

+ sqlite3 enwiki.sqlite '.import --csv '\''|cat -'\'' page'

real    6m25.156s
user    9m57.253s
sys     0m25.416s
+ du -h enwiki.sqlite
1.6G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create unique index "page_id" on "page"("page_id")'

real    1m6.278s
user    0m28.639s
sys     0m35.041s
+ du -h enwiki.sqlite
2.0G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create index "page_namespace_title" on "page"("page_namespace", "page_title")'

real    2m5.788s
user    1m18.303s
sys     0m25.823s
+ du -h enwiki.sqlite
3.0G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create table categorylinks("cl_from" integer, "cl_to" text)'
+ zcat enwiki-latest-categorylinks.sql.gz
+ python mysqldump-to-csv/mysqldump_to_csv.py
+ csvtool col 1,2 -
+ sqlite3 enwiki.sqlite '.import --csv '\''|cat -'\'' categorylinks'

real    20m30.865s
user    29m13.953s
sys     1m6.885s
+ du -h enwiki.sqlite
6.8G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create index "categorylinks_to" on categorylinks("cl_to")'

real    6m36.867s
user    3m12.710s
sys     2m13.543s
+ du -h enwiki.sqlite
8.1G    enwiki.sqlite

Then I walk the category tree preorder depth first with this script and output a few graph representations:

sqlite_preorder.py

#!/usr/bin/env python

import argparse
import html
import sys
import sqlite3
import re
import os.path
from pathlib import Path

def escape_dot(s):
    return s.replace('"', '\\"')

NAMESPACE_TO_TEXT = {
    0: '',
    14: 'Category:',
}

def ns_to_txt(ns):
    if ns in NAMESPACE_TO_TEXT:
        return NAMESPACE_TO_TEXT[ns]
    else:
        return str(ns)

parser = argparse.ArgumentParser()
parser.add_argument('-d', '--depth', type=int)
parser.add_argument('db')
parser.add_argument('titles', nargs='+')
args = parser.parse_args()
out_dot = True
out_txt = True
out_html = True
outdir = 'out'

con = sqlite3.connect(args.db)
cur = con.cursor()
todo = list(map(lambda title: (14, title, 0), args.titles.copy()))
main_title = args.titles[0]
main_title_human = main_title.replace('_', ' ') + ' - Wikipedia CatTree'
basename = os.path.join(outdir, main_title)
visited = set()
Path(outdir).mkdir(parents=True, exist_ok=True)
if out_txt:
    out_txt_f = open(f'{basename}.txt', 'w')
if out_dot:
    out_dot_f = open(f'{basename}.dot', 'w')
    out_dot_f.write('digraph {\n')
if out_html:
    out_html_f = open(f'{basename}.html', 'w')
    out_html_f.write(f'''<!doctype html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>{main_title_human}</title>
<style>
a {{ text-decoration: none; }}
details {{ margin-left: 1em; }}
summary {{ margin-bottom: 0.4em; }}
</style>
</head>
<body>
<h1>{main_title_human}</h1>
''')
last_depth = 0
while len(todo):
    namespace, title, depth = todo.pop()
    depth_delta = depth - last_depth
    if depth_delta <= 0:
        repeat_close = -depth_delta + 1
    else:
        repeat_close = 0
    last_depth = depth
    path_last = ns_to_txt(namespace) + title
    if out_txt:
        out_txt_f.write('{}{} {}\n'.format(' ' * depth, depth, path_last))
    if out_html:
        out_html_f.write('</details>\n' * repeat_close)
        out_html_f.write(f'<details open="true"><summary><a href="https://en.wikipedia.org/wiki/{html.escape(path_last)}">{html.escape(path_last.replace("_", " "))}</a></summary>\n')
    visited.add((namespace, title))
    print(len(visited), file=sys.stderr)
    if namespace == 14:
        for child_namespace, child_title in cur.execute('''
select page_namespace, page_title from categorylinks
inner join page on cl_from = page_id and cl_to = ?
order by page_namespace asc, page_title desc
''', (title,)).fetchall():
            if not (child_namespace, child_title) in visited and not (args.depth is not None and depth == args.depth):
                if out_dot:
                    out_dot_f.write('"{}{}"->"{}{}";\n'.format(ns_to_txt(namespace), escape_dot(title), ns_to_txt(child_namespace), escape_dot(child_title)))
                todo.append((child_namespace, child_title, depth + 1))
if out_txt:
    out_txt_f.close()
if out_dot:
    out_dot_f.write('}\n')
    out_dot_f.close()
if out_html:
    out_html_f.write('</details>\n' * last_depth)
    out_html_f.write('''</body>
</html>
''')
    out_html_f.close

Sample usage to walk down from https://commons.wikimedia.org/wiki/Category:Mathematics up to 6 levels down

time ./sqlite_preorder.py -d6 enwiki.sqlite Mathematics

This walked 82460 pages in 2m30.895s.

It produces the following output files illustrating some basic visualization ideas:

  • out/Mathematics.html: collapsible table of contents with details and summary HTML elements and clickable links to Wikipedia. I'm quite proud of this one! Live demo: https://cirosantilli.com/wikipedia-cattree/Mathematics

    enter image description here

  • out/Mathematics.txt: simple plaintext tree, e.g.:

    0 Category:Mathematics
    1 Category:Fields_of_mathematics
      2 Category:Algebra
       3 Category:Abstract_algebra
        4 Category:Abstract_algebra_stubs
         5 Category:Algebraic_geometry_stubs
          6 10Algebraic-geometry-stub
          6 3-fold
          6 Abelian_surface
          6 Abhyankar–Moh_theorem
          6 Abundance_conjecture
    
  • out/Mathematics.dot: Graphviz format, which is understood by a few different applications

Tested on Ubuntu 23.04.

Atlas answered 18/10, 2023 at 4:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.