I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web page. What can I do? How do I code that using Apache Tika?
How can I use the HTML parser with Apache Tika in Java to extract all HTML tags?
Asked Answered
take a look at the example it may help you blog.jeroenreijn.com/2010/04/… –
Nucleolated
Do you want a plain text version of a html file? If so, all you need is something like:
InputStream input = new FileInputStream("myfile.html");
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
new HtmlParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
The BodyContentHandler, when created with no constructor arguments or with a character limit, will capture the text (only) of the body of the html and return it to you.
You can also you Tika AutoDetectParser to parse any type of files such as HTML. Here is a simple example of that:
try {
InputStream input = new FileInputStream(new File(path));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
System.out.println("Title: " + metadata.get(metadata.TITLE));
System.out.println("Body: " + textHandler.toString());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
© 2022 - 2024 — McMap. All rights reserved.