Wikipedia Mediawiki API get Pageid from URL
Asked Answered
F

5

15

I have a set of full urls like

http://en.wikipedia.org/wiki/Episkopi_Bay
http://en.wikipedia.org/wiki/Monte_Lauro
http://en.wikipedia.org/wiki/Lampedusa
http://en.wikipedia.org/wiki/Himera
http://en.wikipedia.org/wiki/Lago_Cecita
http://en.wikipedia.org/wiki/Aspromonte

I want to find wikipedia pageids for these URLS. I have used the Mediawiki API before but I cant figure out how I may do this.

I have tried extracting the page title from the URLs by taking a substring of lastindexof("/") and the last character and then querying the API to get pageid.

http://en.wikipedia.org/wiki/Episkopi_Bay --> Episkopi_Bay
http://en.wikipedia.org/wiki/Monte_Lauro --> Monte_Lauro
http://en.wikipedia.org/wiki/Lampedusa -- > Lampedusa
http://en.wikipedia.org/wiki/Himera --> Himera
http://en.wikipedia.org/wiki/Lago_Cecita --> Lago_Cecita
http://en.wikipedia.org/wiki/Aspromonte --> Aspromonte

But the problem is that some of my links might be redirects and hence the substring might not always be the title of the page.

TL;DR : How can I find the pageid of a wikipedia page from a URL ?

Florist answered 28/7, 2015 at 17:43 Comment(0)
C
7

I’m not sure if what you call "page id" is the identification number of the page (e.g. 15580374 for English Wikipedia’s Main Page -- found on "Page information" in the toobox in left column) or the normalised title of a page with redirects resolved. The answer below will answer both.

You can use the API action=query, e.g. https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page where you will find minimal information, whose the page id (number).

You can also want to manage more complex cases: title normalisation and/or redirects. Title normalisation (initial capital, underscores changed to spaces, various unicode normalisations iirc, etc.) is included out-of-the box. For redirects, you have to ask specifically by adding "&redirects" to the URL (note that double redirects (=redirect of a redirect) won’t work, but the should not be out there). Example: https://en.wikipedia.org/w/api.php?action=query&titles=main_page&redirects

If you need more information, you can look at https://en.wikipedia.org/w/api.php?action=help&modules=query%2Binfo.

Comradery answered 28/7, 2015 at 19:8 Comment(2)
Thanks for the answer. I know about both of these methods but none help my cause. In both of your answers I need a page_title to work with. I dont have a page_title, I only have the URL. That is where the problem lies. URLs cant be translated to page_titles by substring. Also page_titles can contain non-english UTF8 encoded text. Which wont necessarily be there in URL and shows up as a bunch of transliterated text.Florist
Ok. So you have to first extract the substring as you said, then call the API to normalise the title and resolve the redirects (even with %-encoded titles like ar.wikipedia.org/w/…), and in the case of non-latin characters you have to encode the returned string to UTF8 (e.g. for the French word "Café" the API returns "title": "Caf\u00e9", where "é" is Unicode U+E9).Comradery
R
5

You can add &indexpageids to your query.

For example

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Main%20Pages&indexpageids

or if you are looking for a summary at the same time, here's a more comprehensive example link:

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=barberton%20daisy&prop=extracts&exintro&explaintext&redirects=1&indexpageids

Then if you parse the JSON you will see a property named pageids under query

Rutilant answered 28/5, 2020 at 19:55 Comment(0)
E
2

If you have only the URL, and don't know anything about the wiki, you cannot assume that the part after the last / is the page title, as MediaWiki pages names may contain /. Instead, you will have to start by querying the siteinfo API, like this:

https://www.mediawiki.org/wiki/API:Siteinfo

In the reply, query.general.server and query.general.articlepath combined will give you the url structure, and query.general.script will give you the scriptpath. Depending on where your url's come from, you will need them booth, to account for the default form //mywiki/scriptpath/index.php?title=Namespace:Foo/Bar, and the short url form //mywiki/articlepath/Namespace:Foo/Bar, for an article named Foo/Bar.

To make matters worse, the slash in the “article name” can be either part of the name, or a delimiter for a subpage, depending on the settings of that namespace!

If you know the url syntax of the wikis at hand, @Seb35 already answered all your questions.

Evelynneven answered 5/8, 2015 at 11:50 Comment(0)
C
2

Will just paste some working code here for others coming across this page in google. I couldn't find a way to do this through the api, this code snippet goes to the actual page and extracts the page_id from there. Uses beautifulsoup and regex to do it.

import requests
from bs4 import BeautifulSoup
import time
import re

# Here list_of_urls is the list of urls 
     #['http://en.wikipedia.org/wiki/Episkopi_Bay', etc.]

list_page_ids = []

for url in list_of_urls:        
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    script_content = soup.select_one("head > script:nth-of-type(1)").decode_contents()
    page_id = re.search(r".*wgArticleId..([0-9]+).*",script_content).group(1)
    list_page_ids.append[page_id]
    time.sleep(3)

print(list_page_ids)
Cattier answered 10/11, 2021 at 14:18 Comment(1)
As far as I can tell, this is the only answer that actually addresses the question that was asked.Weslee
C
0

An API call with action=query gives you the pageid of an article :

https://xx.wikipedia.org/w/api.php?action=query&format=json&titles=searched_title

Gives a JSON like :

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "xxxx": {
                "pageid": xxxx,
                "ns": 0,
                "title": "searched_title"
            }
        }
    }
}
Cyrie answered 29/8, 2018 at 14:15 Comment(2)
What part of this answer addresses the fact that only the url is available?Histology
I never use this API again. This answer is probably outdated.Cyrie

© 2022 - 2024 — McMap. All rights reserved.