How to strip headers/footers from Project Gutenberg texts?
Asked Answered
M

4

21

I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is stripping the first twenty eight lines and the last 398, which worked for a large number of the texts. Any suggestions as to ways I can automatically strip the text (which is very similar for lots of the texts, but with slight differences in each case, and a few different templates, as well), as well as suggestions for how to verify that the text has been stripped accurately, would be very useful.

Medardas answered 12/8, 2009 at 22:48 Comment(12)
I don't think you ought to be stripping that information out. At the least, if you do remove this information, please provide the ability for readers of the text to be able to see the licensing information in a linked document of some sort. Still, please reconsider removing this information.Aurie
There are two reasons to remove it: 1) it skews the data for intended purpose, which is not immediate human consumption. For example, "project" or "the" being listed in the vocabulary for, say, swedish. 2) By the terms of the project gutenberg license, you have to pay 20% royalties for any commercial usage, which is ridiculous for public domain texts. I don't mind donating to support a project I've taken advantage of, but a startup can't handle 20% royalties on its main source of data.Medardas
It makes no sense to maintain that information in a language learning corpus. It damages the stochastics to include it, and provides no benefit to humans who will never see the corpus itself.Backman
Oh, also, if I can come up with an accurate enough way to do this, I would be glad to make the code/texts available in a machine readable form so others can do the same. This could preserve all the license text for humans, but still mark boundaries for natural language code. Personally, I think the license is more than slightly ridiculous, myself. Why make public domain texts restricted?Medardas
Project Gutenberg says you may freely use the text, provided you remove all reference to Project Gutenberg, so there's no ethical problem.Irreformable
Maybe you could pot a couple of examples of the first 20 lines, so that we could see what kind of variation you're talking about.Irreformable
Yep; It's just a matter of how to do that for thousands of texts without having to do it all manually. Merely deleting all instances of "Project Gutenberg" still leaves lots of license text remaining.Medardas
Good idea, Beta. The majority of the variation is (a sometimes shortened version of) the title being included in various slots in the beginning license text. Then, in many texts, there are lines for author, title, etc. Following that there is generally a line crediting who proofread it, and then it's entirely non-uniform. here are some examples: gutenberg.org/files/29568/29568-8.txt gutenberg.org/files/17835/17835-0.txt gutenberg.org/dirs/etext03/cnmmm11.txt gutenberg.org/files/1658/1658.txt gutenberg.org/files/17489/17489-8.txtMedardas
You'll notice that, looking at two or three examples, you can start to find a pattern, but there are many examples of every individual pattern being broken, requiring a more sophisticated approach than a regexp, or even a few regexps.Medardas
I looked at three examples from the "Top 100" list at Guttenberg, and they all have the header end with a line like *** START OF THIS PROJECT GUTENBERG EBOOK THUS SPAKE ZARATHUSTRA *** I assume that pattern does not hold or you wouldn't be asking this question...Mcbroom
Nope, one of the examples I listed above doesn't follow that convention. It's one of the most reliable traits, but there are counter examples.Medardas
Interestingly, as @Irreformable mentions re. the public domain-ness after stripping the license, some people are actually charging $4 for Project Gutenberg e-books on Amazon. Quite unethical. daemonsbooks.com/2010/12/01/…Sorghum
I
4

You weren't kidding. It's almost as if they were trying to make the job AI-complete. I can think of only two approaches, neither of them perfect.

1) Set up a script in, say, Perl, to tackle the most common patterns (e.g., look for the phrase "produced by", keep going down to the next blank line and cut there) but put in lots of assertions about what's expected (e.g. the next text should be the title or author). That way when the pattern fails, you'll know it. The first time a pattern fails, do it by hand. The second time, modify the script.

2) Try Amazon's Mechanical Turk.

Irreformable answered 13/8, 2009 at 17:50 Comment(1)
I wish it didn't come down to methods like this, but I think you're probably right. I'll update this question if I find a better way.Medardas
E
13

I've also wanted a tool to strip Project Gutenberg headers and footers for years for playing with natural language processing without contaminating the analysis with boilerplate mixed in with the etxt. After reading this question I finally pulled my finger out and wrote a Perl filter which you can pipe through into any other tool.

It's made as a state machine using per-line regexes. It's written to be easy to understand since speed is not an issue with the typical size of etexts. So far it works on the couple dozen etexts I have here but in the wild there are sure to be many more variations which need to be added. Hopefully the code is clear enough that anybody can add to it:


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}
Exurbia answered 3/12, 2010 at 4:6 Comment(1)
I've put this code up as a gist on github: gist.github.com/751921 - please feel free to watch it for updates or fork it with your own improvements.Exurbia
I
4

You weren't kidding. It's almost as if they were trying to make the job AI-complete. I can think of only two approaches, neither of them perfect.

1) Set up a script in, say, Perl, to tackle the most common patterns (e.g., look for the phrase "produced by", keep going down to the next blank line and cut there) but put in lots of assertions about what's expected (e.g. the next text should be the title or author). That way when the pattern fails, you'll know it. The first time a pattern fails, do it by hand. The second time, modify the script.

2) Try Amazon's Mechanical Turk.

Irreformable answered 13/8, 2009 at 17:50 Comment(1)
I wish it didn't come down to methods like this, but I think you're probably right. I'll update this question if I find a better way.Medardas
S
0

The gutenbergr package in R seems to do an ok job of removing headers, including junk after the 'official' end of the header.

First you'll need to install R/Rstudio, then

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

The strip_headers arg is T by default. You will also probably want to remove illustrations:

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

There's also the gutenberg package in Python which is now archived and hard to install, as well as the gutenberg_cleaner Python package which doesn't seem to work that well.

Stalker answered 1/11, 2017 at 18:48 Comment(0)
W
-1

I am also trying to figure out a way to clean a Gutenberg Project text files for text analysis purpouses, but I use julia and I am probably just trying to reinvent the wheel. So I wonder if it is possible to summarize the ideas/rules to clean the Gutenberg Project files so that anyone can implement them in any language, because I found various nice programs but no general solution on internet. So far, I found that the end of all text files seems to be well marked by a standard line similar to "**** END OF GUTENBERG PROJECT EBOOK . . .". However, the situations is different for finding the starting of the actual text, for which it seems there is no standard mark like (in some case there is no mark line "*** ..." at all). However, metadata like title, authors, etc are written in a standard way - for example: "title: ...". So I am trying to exploit that information. One possibility is to find the last line where the title appears (within the few first dozens of lines) and after that title there is the "real text"... I will try to keep this answer updated.

Weightless answered 9/8, 2023 at 21:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.