How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?
Asked Answered
T

1

20

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

Talbot answered 16/6, 2016 at 13:52 Comment(0)
C
24

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Elastic Ingest

Ingest Plugin

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own.

Coronal answered 30/10, 2016 at 21:54 Comment(18)
Why do you need a mapping for the data field? Doesn't the pipeline pick up the data field and process it without it having to be explicitly mapped? What would this mapping look like?Dauntless
@Dauntless you do not need to map the field actually... the processor will create an inside (of your processor) the field. But sometimes you need to have some filter like the updated answer. hope it helpsCoronal
I've fought a lot with Ingest Attachment plugin. It can't be used in production. I use Ambar (ambar.rdseventeen.com) as a solid solution for stroing and searching through documentsMilner
@SochiX sure we can use it in production, as it is in production in several cases. I myself has a project running in production mode and running pretty well. not bit deal, but there are 1K files, over 2Gb data and search results less than 1 second.Coronal
@Evert thank you for your comment. But in my case I have 3 000 000 files and total size of index is 268 Gb. Ingest attachment just eats all the RAM when it tries to proccess the file larger then 40 MB. That's why I switched to Ambar.Milner
I wrote a blog post about Ingest Attachment plugin problems. Check it out: Ingest Attachment Plugin for ElasticSearch: Should You Use It?Milner
@SochiX you are the developer of Ambar, nice, I understand your enthusiasm. I will give it a try, but confess... I am really happy with ES + Ingest. Cheers!Coronal
Hey, I get how to create the pipeline and am able to push a document to my index. However, what if my pdf was to be a stored in a field of my django object. How do I index other fields and this pdf?Courtmartial
@RishabhRanawat as in my Edit 4 you just enter the properties (fields) you need, as of the official documentation elastic.co/guide/en/elasticsearch/reference/current/…, and when indexing, just fill your post with the data as needed. Hope it was of help.Coronal
when using XPUT to post pdf i get {"error":"Content-Type header [application/pdf] is not supported","status":406} any suggestions?Changeover
@Changeover are you using Curl? Which version of es you are using? I suggest post a new question and post your code so we could help you better.Coronal
How do you query this document ?Balbinder
Hi @IlyaP, Did you find the solution to your problem for huge data?Carpetbagger
@Carpetbagger yep, try Ambar!Milner
@IlyaP, Ambar looks great! How does it compare with github.com/opensemanticsearch ?Dissonancy
Hi, @Dissonancy it's more lightweight I guess and easy to setup/hostMilner
@IlyaP I created a question on Ambar github to avoid hijacking this thread. Could you comment? Thanks! github.com/RD17/ambar/issues/287Dissonancy
Good tips you have provided!Rafa

© 2022 - 2024 — McMap. All rights reserved.