I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br>
but other tags, e.g. <tr/>
, </p>
leads to a new line too.
Sample HTML pages for testing are:
- http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
- http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.
Example using JSoup:
public class JSoupTest {
@Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}