Get first paragraph (and only text) of a Wikipedia article returns not desired result
Asked Answered
L

4

11

I'm trying to retrieve the first paragraph of text for an article of Wikipedia, UNIX in this example, but it returns me a non-desired output.

For what I've been reading on the Wikipedia api and here on StackOverflow, this is the request URL to make the call:

http://en.wikipedia.org/w/api.php?format=php&action=query&titles=unix&redirects=1&prop=revisions&rvprop=content&rvsection=0&rvlimit=1

My expected output will be:

Unix (officially trademarked as UNIX, sometimes also written as Unix in small caps) is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, Michael Lesk and Joe Ossanna.[1] The Unix operating system was first developed in assembly language, but by 1973 had been almost entirely recoded in C, greatly facilitating its further development and porting to other hardware. Today's Unix system evolution is split into various branches, developed over time by AT&T as well as various commercial vendors, universities (such as University of California, Berkeley's BSD), and non-profit organizations.

My current result:

{{Use dmy dates|date=August 2012}}
{{Infobox OS
|name               = Unix
|logo               = 
|screenshot         = [[File:Unix history-simple.svg|250px]]
|caption            = Evolution of Unix and Unix-like systems
|website            = [http://www.unix.org unix.org]
|developer          = [[Ken Thompson (computer programmer)|Ken Thompson]], [[Dennis Ritchie]], [[Brian Kernighan]], [[Douglas McIlroy]], and [[Joe Ossanna]] at [[Bell Labs]]
|source_model       = Historically [[Closed source software|closed source]], now some Unix projects ([[Berkeley Software Distribution|BSD]] family and [[Illumos]]) are [[open source]]d.
|frequently_updated = yes <!-- Release version update? Don't edit this page, just click on the version number! -->
|programmed_in      = [[C (programming language)|C]] 
|kernel_type        = [[Monolithic Kernel|Monolithic]]
|ui                 = [[Command-line interface]] & [[Graphical user interface|Graphical]] ([[X Window System]])
|language           = English 
|family             = Unix
|released           = {{start date and age|df=yes|1969}}
|license            = [[Proprietary software|Proprietary]]
|working_state      = Current 
}}

'''Unix''' (officially trademarked as '''UNIX''', sometimes also written as '''<span style="font-variant: small-caps;">Unix</span>''' in small caps) is a [[Computer multitasking|multitasking]], [[multi-user]] computer [[operating system]] originally developed in 1969 by a group of [[American Telephone & Telegraph|AT&T]] employees at [[Bell Labs]], including [[Ken Thompson]], [[Dennis Ritchie]], [[Brian Kernighan]], [[Douglas McIlroy]], [[Michael Lesk]] and [[Joe Ossanna]].<ref name=" Ritchie">{{cite journal
  | last = Ritchie
  | first = D.M.
  | authorlink = 
  | coauthors = Thompson, K.
  | title = The UNIX Time-Sharing System
  | journal = Bell System Tech. J.
  | volume = 57
  | issue = 6
  | pages = 1905-1929
  | publisher = American Tel. & Tel.
  | location = USA
  | date = July 1978
  | url = http://www.alcatel-lucent.com/bstj/vol57-1978/articles/bstj57-6-1905.pdf
  | issn = 
  | doi = 
  | id = 
  | accessdate = December 9, 2012}}</ref>  The Unix operating system was first developed in [[assembly language]], but by 1973 had been almost entirely recoded in [[C (programming language)|C]], greatly facilitating its further development and [[Software portability|porting]] to other hardware. Today's Unix system evolution is split into various branches, developed over time by AT&T as well as various commercial vendors, universities (such as [[University of California, Berkeley]]'s [[BSD]]), and [[non-profit]] organizations.

[[The Open Group]], an industry standards consortium, owns the UNIX trademark. Only systems fully compliant with and certified according to the [[Single UNIX Specification]] are qualified to use the trademark; others might be called ''Unix system-like'' or ''[[Unix-like]]'', although the Open Group disapproves<ref>[http://www.unix.org/questions_answers/faq.html#7a  What is a "Unix-like" operating system?] Unix.org FAQ</ref> of this term.  However, the term ''Unix'' is often used informally to denote any operating system that closely resembles the trademarked system.

During the late 1970s and early 1980s, the influence of Unix in academic circles led to large-scale adoption of Unix (particularly of the [[Berkeley Software Distribution|BSD]] variant, originating from the [[University of California, Berkeley]]) by commercial startups, the most notable of which are [[Solaris (operating system)|Solaris]], [[HP-UX]], [[Sequent Computer Systems|Sequent]], and [[AIX operating system|AIX]], as well as [[Darwin (operating system)|Darwin]], which forms the core set of components upon which [[Apple Inc.|Apple]]'s [[OS X]], [[Apple TV]], and [[IOS (Apple)|iOS]] are based.<ref>{{cite web|url=http://marketshare.hitslink.com/operating-system-market-share.aspx?qprid=8&qpcustomd=0 |title=Operating system market share |publisher=Marketshare.hitslink.com |date= |accessdate=2012-08-22}}</ref><ref>{{cite web|url=http://developer.apple.com/library/mac/#documentation/MacOSX/Conceptual/OSX_Technology_Overview/SystemTechnology/SystemTechnology.html#//apple_ref/doc/uid/TP40001067-CH207-BCICAIFJ |title=Loading |publisher=Developer.apple.com |date= |accessdate=2012-08-22}}</ref> Today, in addition to certified Unix systems such as those already mentioned, [[Unix-like]] operating systems such as [[MINIX]], [[Linux]], and [[BSD]] descendants ([[FreeBSD]], [[NetBSD]], [[OpenBSD]], and [[DragonFly BSD]]) are commonly encountered. The term ''traditional Unix'' may be used to describe an operating system that has the characteristics of either [[Version 7 Unix]] or [[UNIX System V]]."

What is the correct way of retrieving an article?

Thanks in advance!

Leatherworker answered 10/12, 2012 at 18:42 Comment(5)
There is no “only text” in Wikipedia. You can either get wikitext (which is what you got) or HTML.Poulson
There is no way to get rid of the infobox? I've HTML now, but I've a table. I can't find anything on the api.php documentaiton.Leatherworker
Important question !!Mima
HI, i was looking for exact same solution.. did u crack any ??? please share.Indecisive
Infoboxes are at a much higher level than the API. The API is generic at the software level, and the exact same software runs many kinds of other sites including Wiktionary, Wikivoyage, many languages, which may or may not have infoboxes if they want. Infoboxes are made from MediaWiki templates, which is basically a macro system that allows a small amount of wikitext to generate a larger amount of content. You could learn those templates and use that knowledge to parse and edit out the bits of wikitext you don't want. Or much the same with the HTML as you convert that to plain text.Destitution
A
8

If you want plain text only, use TextExtracts: http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&explaintext=1&titles=Unix

This would produce: Unix is a multitasking, multi-user computer operating system that exists in many variants. The original Unix was developed at AT&T's Bell Labs research center by Ken Thompson, Dennis Ritchie, and others. From the power user's or programmer's perspective, Unix systems are characterized by a modular design that is sometimes called the "Unix philosophy," meaning the OS provides a set of simple tools that each perform a limited, well-defined function, with a unified filesystem as the main means of communication and a shell scripting and command language to combine the tools to perform complex workflows.

Alisun answered 22/4, 2014 at 6:32 Comment(1)
I know this answer is a bit dated, but still… A very useful argument for the OP could be exintro=1: en.wikipedia.org/w/…Trochlear
B
0

I've been handling this in Python. The first task is to get the desired text; after that, you need to parse the HTML and remove all the extraneous information.

The following function will get you the nth text section (n=0 returns the abstract):

import requests

def getWikiSection(topic, n):
    url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=json&prop=text&section=%s' % (topic, str(n))
    json_response = requests.get(url).json().items()
    if len(json_response) > 1 and json_response[1][0] == u'error':
        print json_response[1][1][u'info']
        return None
    return stripTags(json_response[0][1][u'text'][u'*'])

A quick walkthrough: first, we create the URL for the given topic; then, we grab the JSON response; if we've queried for an invalid section or topic (i.e., no page exists for that topic or we've gone beyond the length of the page), we print an error; else, we clean the response.

Cleaning the response is handled by the 'stripTags' function on the last line, which removes HTML tags. Here it is:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

def stripTags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Of course, this can be extended to parse the text however you like. For example, I removed the citations as follows:

import re

def removeReferences(s):
    return re.sub(r'\[[0-9]+\]', '', s)

Hope this helps.

Berthaberthe answered 31/5, 2013 at 11:49 Comment(0)
U
0

I would check out the MobileFrontend Extension, it will at least give you a few <p> to work with. Using http://en.wikipedia.org/w/api.php?format=json&action=mobileview&page=Unix&sections=0&prop=text|sections gives

{"mobileview":{"sections":[{"id":0,"text":"\nUnix\n\n
\nEvolution of Unix and Unix-like systems\n\n\n\nCompany / developer\nKen Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna at Bell Labs\n\n\nProgrammed in\nC and Assembly language\n\n\nOS family\nUnix\n\n\nWorking state\nCurrent\n\n\nSource model\nHistorically closed source, now some Unix projects (BSD family and Illumos) are open sourced.\n\n\nInitial release\nApril 20, 1969; 44 years ago (April 20, 1969)\n\n\nAvailable language(s)\nEnglish\n\n\nKernel type\nMonolithic\n\n\nDefault user interface\nCommand-line interface & Graphical (X Window System)\n\n\nLicense\nProprietary\n\n\nOfficial website\nhttp://www.unix.org\">unix.org\n\n\n

Unix (officially trademarked as UNIX, sometimes also written as Unix in small caps) is a multitasking, multi-user computer operating system originally developed in 1969 by a group of

(snip)

You'd have to take that and parse it some other way (perl, bash, whatever) but at that point you might as well ditch the API and go for some curl or wget action, which would make it easy.

Unfetter answered 23/7, 2013 at 22:19 Comment(0)
M
0

Try this url.

(Note: pagename is the page that you are requesting)

But this returns a bunch of html crap...you could filter the hml by doing this:

$.getJSON("http://en.wikipedia.org/w/api.php?"+"action=parse&format=json&prop=text&section=all&page=" + entry + "&redirects&callback=?", function(data)
    {   

        if (!data.error)
        {
            var markup = data.parse.text["*"];
            if (typeof markup !== "undefined")
            {
                $("#entry").text(entry).show();
                var blurb = $('<div id="articleText"></div>').html(markup);

                // remove links as they will not work
                blurb.find('a').each(function() { $(this).replaceWith($(this).html()); });

                // remove any references
                blurb.find('sup').remove();

                // remove cite error
                blurb.find('.mw-ext-cite-error').remove();
                $('#article').html($(blurb).find('p'));

                $("#article").append(link);
                // console.log(markup);
            }
        }
    });

You can read more here

Milomilon answered 29/9, 2014 at 11:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.