Extract the text from URLs using TIKA

L

4

7

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

Loidaloin answered 11/7, 2011 at 21:30 Comment(0)

S

6

This is from lucid:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Instead of creating a PDFParser you can use Tika's AutoDetectParser to automatically process diff types of files:

Parser parser = new AutoDetectParser();

Supat answered 23/8, 2011 at 19:54 Comment(0)

Q

7

Check the documentation - yes you can.

Example

java -jar tika-app-0.9.jar -t https://mcmap.net/q/1418367/-extract-the-text-from-urls-using-tika

will show you the text on this page.

Quar answered 11/7, 2011 at 21:40 Comment(2)

And if I need to use this in a Java code and save the text from url in a text file.. Then it is also possible..?? And I am not using maven. I want to use this in java code.. – Loidaloin 11/7, 2011 at 21:44

the description how to use tika with ant is just below the description of how to use it with Maven, and just above the instructions for the command line tool. If you need some inspiration on how to embed it, I'm certain there's info on the website, and there's always the source of the command line tool as well. – Quar 11/7, 2011 at 21:47

S

6

This is from lucid:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Instead of creating a PDFParser you can use Tika's AutoDetectParser to automatically process diff types of files:

Parser parser = new AutoDetectParser();

Supat answered 23/8, 2011 at 19:54 Comment(0)

M

3

Yes, you can do that. Here is the code. This code uses apache http client

HttpGet httpget = new HttpGet("http://url.here"); 
    HttpEntity entity = null;
    HttpClient client = new DefaultHttpClient();
    HttpResponse response = client.execute(httpget);
    entity = response.getEntity();
    if (entity != null) {
        InputStream instream = entity.getContent();
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        parser.parse( instream, handler, metadata, new ParseContext());
        String plainText = handler.toString();
        FileWriter writer = new FileWriter( "/scratch/cache/output.txt");
        writer.write( plainText );
        writer.close();
        System.out.println( "done");
    }

Martineau answered 25/3, 2012 at 20:40 Comment(0)

P

1

to extract content from URL not from local file use this code:

    byte[] raw = content.getContent();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
    LOG.info("content: " + handler.toString());

Pastorale answered 14/2, 2012 at 6:52 Comment(1)

You can also use TikaInputStream.get(byte[]) to build the InputStream – Precis 14/2, 2012 at 10:13

Recommended topics

Hot tags