I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is stripping the first twenty eight lines and the last 398, which worked for a large number of the texts. Any suggestions as to ways I can automatically strip the text (which is very similar for lots of the texts, but with slight differences in each case, and a few different templates, as well), as well as suggestions for how to verify that the text has been stripped accurately, would be very useful.
You weren't kidding. It's almost as if they were trying to make the job AI-complete. I can think of only two approaches, neither of them perfect.
1) Set up a script in, say, Perl, to tackle the most common patterns (e.g., look for the phrase "produced by", keep going down to the next blank line and cut there) but put in lots of assertions about what's expected (e.g. the next text should be the title or author). That way when the pattern fails, you'll know it. The first time a pattern fails, do it by hand. The second time, modify the script.
2) Try Amazon's Mechanical Turk.
I've also wanted a tool to strip Project Gutenberg headers and footers for years for playing with natural language processing without contaminating the analysis with boilerplate mixed in with the etxt. After reading this question I finally pulled my finger out and wrote a Perl filter which you can pipe through into any other tool.
It's made as a state machine using per-line regexes. It's written to be easy to understand since speed is not an issue with the typical size of etexts. So far it works on the couple dozen etexts I have here but in the wild there are sure to be many more variations which need to be added. Hopefully the code is clear enough that anybody can add to it:
#!/usr/bin/perl
# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010
use strict;
my $debug = 0;
my $state = 'beginning';
my $print = 0;
my $printed = 0;
while (1) {
$_ = <>;
last unless $_;
# strip UTF-8 BOM
if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
$_ = substr($_, 3);
}
if ($state eq 'beginning') {
if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
$state = 'normal pg header';
$debug && print "state: beginning -> normal pg header\n";
$print = 0;
} elsif (/^$/) {
$state = 'beginning blanks';
$debug && print "state: beginning -> beginning blanks\n";
} else {
die "unrecognized beginning: $_";
}
} elsif ($state eq 'normal pg header') {
if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
$state = 'end of normal header';
$debug && print "state: normal pg header -> end of normal pg header\n";
} else {
# body of normal pg header
}
} elsif ($state eq 'end of normal header') {
if (/^(Produced by|Transcribed from)/) {
$state = 'post header';
$debug && print "state: end of normal pg header -> post header\n";
} elsif (/^$/) {
# blank lines
} else {
$state = 'etext body';
$debug && print "state: end of normal header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'post header') {
if (/^$/) {
$state = 'blanks after post header';
$debug && print "state: post header -> blanks after post header\n";
} else {
# multiline Produced / Transcribed
}
} elsif ($state eq 'blanks after post header') {
if (/^$/) {
# more blank lines
} else {
$state = 'etext body';
$debug && print "state: blanks after post header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'beginning blanks') {
if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
$state = 'header include';
$debug && print "state: beginning blanks -> header include\n";
} elsif (/^Title: /) {
$state = 'aus header';
$debug && print "state: beginning blanks -> aus header\n";
} elsif (/^$/) {
# more blanks
} else {
die "unexpected stuff after beginning blanks: $_";
}
} elsif ($state eq 'header include') {
if (/^$/) {
# blanks after header include
} else {
$state = 'aus header';
$debug && print "state: header include -> aus header\n";
}
} elsif ($state eq 'aus header') {
if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
$state = 'end of aus header';
$debug && print "state: aus header -> end of aus header\n";
} elsif (/^A Project Gutenberg of Australia eBook$/) {
$state = 'end of aus header';
$debug && print "state: aus header -> end of aus header\n";
}
} elsif ($state eq 'end of aus header') {
if (/^((Title|Author): .*)?$/) {
# title, author, or blank line
} else {
$state = 'etext body';
$debug && print "state: end of aus header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'etext body') {
# here's the stuff
if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
$state = 'footer';
$debug && print "state: etext body -> footer\n";
$print = 0;
} elsif (/^(\*\*\* ?)?end of (the )?project/i) {
$state = 'footer';
$debug && print "state: etext body -> footer\n";
$print = 0;
}
} elsif ($state eq 'footer') {
# nothing more of interest
} else {
die "unknown state '$state'";
}
if ($print) {
print;
++$printed;
} else {
$debug && print "## $_";
}
}
You weren't kidding. It's almost as if they were trying to make the job AI-complete. I can think of only two approaches, neither of them perfect.
1) Set up a script in, say, Perl, to tackle the most common patterns (e.g., look for the phrase "produced by", keep going down to the next blank line and cut there) but put in lots of assertions about what's expected (e.g. the next text should be the title or author). That way when the pattern fails, you'll know it. The first time a pattern fails, do it by hand. The second time, modify the script.
2) Try Amazon's Mechanical Turk.
The gutenbergr package in R seems to do an ok job of removing headers, including junk after the 'official' end of the header.
First you'll need to install R/Rstudio, then
install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519') # give it the id number of the text
The strip_headers arg is T by default. You will also probably want to remove illustrations:
library(data.table)
t <- as.data.table(t) # I hate tibbles -- datatables are easier to work with
head(t) # get the column names
# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))
There's also the gutenberg package in Python which is now archived and hard to install, as well as the gutenberg_cleaner Python package which doesn't seem to work that well.
I am also trying to figure out a way to clean a Gutenberg Project text files for text analysis purpouses, but I use julia and I am probably just trying to reinvent the wheel. So I wonder if it is possible to summarize the ideas/rules to clean the Gutenberg Project files so that anyone can implement them in any language, because I found various nice programs but no general solution on internet. So far, I found that the end of all text files seems to be well marked by a standard line similar to "**** END OF GUTENBERG PROJECT EBOOK . . .". However, the situations is different for finding the starting of the actual text, for which it seems there is no standard mark like (in some case there is no mark line "*** ..." at all). However, metadata like title, authors, etc are written in a standard way - for example: "title: ...". So I am trying to exploit that information. One possibility is to find the last line where the title appears (within the few first dozens of lines) and after that title there is the "real text"... I will try to keep this answer updated.
© 2022 - 2024 — McMap. All rights reserved.