Using boilerpipe to extract non-english articles
Asked Answered
C

6

6

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library: (first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
Crescen answered 13/2, 2012 at 11:51 Comment(0)
C
1

Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
Crescen answered 6/3, 2012 at 15:31 Comment(0)
T
2

You don't have to modify inner Boilerpipe classes.

Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

Regards!

Tremendous answered 5/6, 2012 at 12:31 Comment(1)
First, sorry to take so long to comment your answer, and thank you for giving it. Unfortunately it is not working for me. I just tried it, and all the letters with accent marks become '?' when i print the extracted article. I will remain with the previous solution for now.Crescen
T
1

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

Toast answered 13/2, 2012 at 12:7 Comment(2)
Thank you for your answer. I'm sorry for only giving attention to it now but i have been stuck in another project. I tried printing the enconding that was set on the variable cs after this chunk of code, and the result was always ISO-8859-1. I also tried to force the encoding to be UTF-8, but got no better results. The problem must be in one of the conversions, to HTMLDocument, to TextDocument, etc. But i'm having some trouble printing their text content. Any ideas? Thanks again.Crescen
Andrei, you were right. I was trying to complicate a lot, but in the end it was a very simple solution. Thanks again, i'm sorry i can't upvote you yet.Crescen
C
1

Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
Crescen answered 6/3, 2012 at 15:31 Comment(0)
D
1

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.

Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.

This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

Deviltry answered 7/2, 2014 at 14:37 Comment(0)
A
1

Java:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse: Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

enter image description here

Aubergine answered 27/7, 2014 at 19:25 Comment(0)
L
0

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);
Lucilius answered 2/6, 2013 at 18:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.