Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?
Extract the text from URLs using TIKA
Asked Answered
This is from lucid:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
Instead of creating a PDFParser
you can use Tika's AutoDetectParser
to automatically process diff types of files:
Parser parser = new AutoDetectParser();
Check the documentation - yes you can.
Example
java -jar tika-app-0.9.jar -t https://mcmap.net/q/1418367/-extract-the-text-from-urls-using-tika
will show you the text on this page.
And if I need to use this in a Java code and save the text from url in a text file.. Then it is also possible..?? And I am not using maven. I want to use this in java code.. –
Loidaloin
the description how to use tika with ant is just below the description of how to use it with Maven, and just above the instructions for the command line tool. If you need some inspiration on how to embed it, I'm certain there's info on the website, and there's always the source of the command line tool as well. –
Quar
This is from lucid:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
Instead of creating a PDFParser
you can use Tika's AutoDetectParser
to automatically process diff types of files:
Parser parser = new AutoDetectParser();
Yes, you can do that. Here is the code. This code uses apache http client
HttpGet httpget = new HttpGet("http://url.here");
HttpEntity entity = null;
HttpClient client = new DefaultHttpClient();
HttpResponse response = client.execute(httpget);
entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse( instream, handler, metadata, new ParseContext());
String plainText = handler.toString();
FileWriter writer = new FileWriter( "/scratch/cache/output.txt");
writer.write( plainText );
writer.close();
System.out.println( "done");
}
to extract content from URL not from local file use this code:
byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
LOG.info("content: " + handler.toString());
You can also use TikaInputStream.get(byte[]) to build the InputStream –
Precis
© 2022 - 2024 — McMap. All rights reserved.