Summarizing a Wikipedia Article
Asked Answered
D

2

18

I find myself having to learn new things all the time. I've been trying to think of ways I could expedite the process of learning new subjects. I thought it might be neat if I could write a program to parse a wikipedia article and remove everything but the most valuable information.

I started by taking the Wikipedia article on PDFs and extracting the first 100 sentences. I gave each sentence a score based on how valuable I thought it was. I ended up creating a file following this format:

<sentence>
<value>
<sentence>
<value>
etc.

I then parsed this file and attempted to find various functions that would correlate each sentence with the value I had given it. I've just begun learning about machine learning and statistics and whatnot, so I'm doing a lot of fumbling around here. This is my latest attempt: https://github.com/JesseAldridge/Wikipedia-Summarizer/blob/master/plot_sentences.py.

I tried a bunch of stuff that didn't seem to produce much of any correlation at all -- average word length, position in the article, etc. Pretty much the only thing that produced any sort of useful relationship was the length of the string (more specifically, counting the number of lowercase letter 'e's seemed to work best). But that seems kind of lame, because it seems obvious that longer sentences would be more likely to contain useful information.

At one point I thought I had found some interesting functions, but then when I tried removing outliers (by only counting the inner quartiles), they turned out to produce worse results then simply returning 0 for every sentence. This got me wondering about how many other things I might be doing wrong... I'm also wondering whether this is even a good way to be approaching this problem.

Do you think I'm on the right track? Or is this just a fool's errand? Are there any glaring deficiencies in the linked code? Does anyone know of a better way to approach the problem of summarizing a Wikipedia article? I'd rather have a quick and dirty solution than something perfect that takes a long time to put together. Any general advice would also be welcome.

Dissemblance answered 1/1, 2012 at 2:21 Comment(4)
Next, you'll want us to use newspeak to make the scanned article even shorter ;)Stateless
You are clearly too old. Leave this sort of thing to 16 year olds wired.com/gadgetlab/2011/12/summly-app-summarizationGadolinite
:) Summly looks cool. I can't run it on my ipod, but I can read the reviews. They were pretty mixed. I got the impression it doesn't work that well.Dissemblance
Am I wrong but it seems to me the 16 yr older is using a neural network with a genetic algorithm mixed in? Simple and effective.Pompei
G
14

Considering that your question relates more to a research activity than a programming problem, you should probably look at scientific literature. Here you will find published details of a number of algorithms that perform exactly what you want. A google search for "keyword summarization" finds the following:

Single document Summarization based on Clustering Coefficient and Transitivity Analysis

Multi-document Summarization for Query Answering E-learning System

Intelligent Email: Aiding Users with AI

If you read the above, then follow the references they contain, you will find a whole wealth of information. Certainly enough to build a functional application.

Gadolinite answered 1/1, 2012 at 7:32 Comment(2)
Ok, so I've just gotta whip up a dependency graph based on syntactic dependency relation analysis and use a clustering coefficient to measure node connections. Then it's a simple matter of pulling out node triangles and using them to extract key sentences. ffs... so much for quick and dirty. Seriously though, thanks for the papers. That's proabably the best info I'm gonna be able to get.Dissemblance
Hey ... you've just successfully summarized the keywords for that paper. Perhaps this is a job for a mechanical turk!Gadolinite
K
1

Just my two cents...

Whenever I'm browsing a new subject on Wikipedia, I typically perform a "breadth-first" search; I refuse to move on to another topic until I've scanned each and every link that the page connects to (which introduces a topic I'm not already familiar with). I read the first sentence of each paragraph, and if I see something in that article that appears to relate to the original topic, I repeat the process.

If I were to design the interface for a Wikipedia "summarizer", I would

  1. Always print the entire introductory paragraph.

  2. For the rest of the article, print any sentence that has a link in it.

    2a. Print any comma separated lists of links as a bullet pointed list.

  3. If the link to the article is "expanded", print the first paragraph for that article.

  4. If that introductory paragraph is expanded, repeat the listing of sentences with links.

This process could repeat indefinitely.

What I'm saying is that summarizing Wikipedia articles isn't the same as summarizing an article from a magazine, or a posting on a blog. The act of crawling is an important part of learning introductory concepts quickly via Wikipedia, and I feel it's for the best. Typically, the bottom half of articles is where the citation needed tags start popping up, but the first half of any given article is considered given knowledge by the community.

Kinnikinnick answered 1/1, 2012 at 20:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.