titles of all wikipedia articles without redirect
Asked Answered
W

2

1

I am trying to get a list of all wikipedia title without redirects.

They say that they have about 6,410k articles. I tried to get a list though https://dumps.wikimedia.org/enwiki/latest/ and the file enwiki-latest-all-titles-in-ns0.gz. But this has more than 16 million. So it includes titles with redirects

As suggested by an answer, I tried using quarry. I ran this simple query from the database enwiki_p:

select page_title from page where page_is_redirect = 0;

Now, the challenge with this is all titles which have more than one word are automatically considered redirects because this database removes all spaces in page title (see sample of database)

enter image description here

How do I know which page is an actual redirect or just considered on due to spaces being removed.

Willhite answered 19/1, 2022 at 20:16 Comment(0)
B
1

There are other ways to do this. Being a contributor to wikimedia projects, you'll be able to run SQL queries in a site called Quarry. I'm not sure whether it can output large result set. But with SQL access you can really filter out redirects.

UPDATE: As for AccessibleComputing, it's really a redirect, as you can visit the link which prevents actual redirecting.

Basilisk answered 20/1, 2022 at 0:5 Comment(2)
This should be a comment, not an answer (you're just providing a reference link).Bounded
hey Alexander! thanks for the suggestion! I tried using quarry but ran into a challenge. have updated my question with details, could you please have a lookWillhite
L
0

Quarry was taking too long for me, I'm not sure if it would ever finish and if the results would be downloadable.

I managed to do it instead by downloading the dump enwiki-latest-page.sql, which can be found from the dump listing at: https://dumps.wikimedia.org/enwiki/latest/ , importing it into MariaDB locally, and then running a simple query on it:

sudo mariadb enwiki -B -N -e 'select page_title from page where page_namespace = 0 and page_is_redirect = 0 order by page_title' > titles.txt

It took about half an hour, but it got the job done.

I have provided further details at: https://mcmap.net/q/376391/-how-to-obtain-a-list-of-titles-of-all-wikipedia-articles

Lux answered 7/10, 2023 at 8:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.