How can I get Wikipedia content using Wikipedia's API?
Asked Answered
N

12

81

I want to get the first paragraph of a Wikipedia article.

What is the API query to do so?

Nicotinism answered 25/8, 2011 at 4:50 Comment(0)
F
80

See this section in the MediaWiki API documentation, specifically involving getting the contents of the page.

use the sandbox to test the API call.

These are the key parameters.

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 specifies to only return the lead section.

See this example.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

To get the HTML, you can use similarly use action=parse

Note that you'll have to strip out any templates or infoboxes.

edit: If you want to extract the plain text (without wikilinks, etc), you can use the TextExtracts API. Use the available parameters there to adjust your output.

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=pizza&explaintext=1&exsectionformat=plain
Freesia answered 25/8, 2011 at 5:0 Comment(3)
do I have to send action=parse query after getting the value of that?Nicotinism
I want to get a clean text of it, should I write parser by myself? or there are some API query to do so? Thank YouNicotinism
You can use the text extract api. edited my answer above. mediawiki.org/wiki/…Freesia
P
42

See Is there a Wikipedia API just for retrieve the content summary? for other proposed solutions. Here is one that I suggested:

There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose. Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences(exsentences)

Here is a sample query http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow and the API sandbox http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow to experiment more with this query.

Please note that if you want the first paragraph specifically you still need to get the first tag. However in this API call there are no additional assets like images to parse. If you are satisfied with this introduction summary you can retrieve the text by running a function like PHP's strip_tag that remove the HTML tags.

Pilau answered 29/8, 2013 at 8:2 Comment(1)
Thank you for this answer, the extract thing was very useful.Northern
E
25

I do it this way:

https://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

The response you get is an array with the data, easy to parse:

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

To get just the first paragraph limit=1 is what you need.

Elderberry answered 27/5, 2016 at 16:19 Comment(3)
It's strange that this method doesn't work anymore. It doesn't give me any descriptionFerrate
same here, this method is returning no description, any reason why?Hornwort
Also seeing this endpoint no longer providing any description...Coeducation
C
9

To GET first paragraph of an article:

https://en.wikipedia.org/w/api.php?action=query&titles=Belgrade&prop=extracts&format=json&exintro=1

I have created short Wikipedia API docs for my own needs. There are working examples on how to get article(s), image(s) and similar.

Capybara answered 27/1, 2020 at 9:41 Comment(0)
A
5

If you need to do this for a large number of articles, then instead of querying the website directly, consider downloading a Wikipedia database dump and then accessing it through an API such as JWPL.

Allo answered 4/8, 2012 at 15:9 Comment(0)
R
5

You can get the introduction of the article in Wikipedia by querying pages such as https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=java. You just need to parse the JSON file and the result is plain text which has been cleaned including removing links and references.

Ruction answered 17/12, 2015 at 3:31 Comment(0)
T
4
<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>
Triviality answered 15/11, 2016 at 21:54 Comment(2)
consider adding a bit of textual description to your answer :) (i.e. what does it bring compared to others)Faradism
An explanation would be in order. Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).Chief
T
2

You can download the Wikipedia database directly and parse all pages to XML with Wiki Parser, which is a standalone application. The first paragraph is a separate node in the resulting XML.

Alternatively, you can extract the first paragraph from its plain-text output.

Tacy answered 29/1, 2015 at 16:33 Comment(0)
M
2

You can use jQuery to do that. First create the URL with appropriate parameters. Check this link to understand what the parameters mean. Then use the $.ajax() method to retrieve the articles. Note that Wikipedia does not allow cross origin request. That's why we are using dataType : jsonp in the request.

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});
Mafia answered 26/5, 2017 at 0:14 Comment(0)
S
2

You can use the extract_html field of the summary REST endpoint for this: e.g. https://en.wikipedia.org/api/rest_v1/page/summary/Cat.

Note: This aims to simply the content a bit by removing most of the pronunciations, mainly in parentheses in some cases.

Swot answered 19/4, 2018 at 16:48 Comment(1)
This should def be the top answer. Super simpleDissemble
G
1

Suppose keyword = "Batman" //Term you want to search, use:

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text&section=0

To get summary/first paragraph from Wikipedia in JSON format.

Gastrula answered 25/2, 2019 at 11:19 Comment(0)
V
1

Here is program that will dump french and english wiktionary and wikipedia:

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

Another approach to acquire wikimedia data is to rely on kiwix zim dumps.

Vergievergil answered 14/8, 2021 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.