How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? - McMap

About

How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

Asked 16/6, 2016 at 13:52 Answered 30/10, 2016 at 21:54

Solved pdf elasticsearch plugins attachment elasticsearch-plugin

T

1

20

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

Talbot answered 16/6, 2016 at 13:52 Comment(0)

C

24

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own.

Coronal answered 30/10, 2016 at 21:54 Comment(18)

Why do you need a mapping for the data field? Doesn't the pipeline pick up the data field and process it without it having to be explicitly mapped? What would this mapping look like? – Dauntless 11/11, 2016 at 19:4

@Dauntless you do not need to map the field actually... the processor will create an inside (of your processor) the field. But sometimes you need to have some filter like the updated answer. hope it helps – Coronal 11/11, 2016 at 19:10

I've fought a lot with Ingest Attachment plugin. It can't be used in production. I use Ambar (ambar.rdseventeen.com) as a solid solution for stroing and searching through documents – Milner 31/1, 2017 at 11:43

@SochiX sure we can use it in production, as it is in production in several cases. I myself has a project running in production mode and running pretty well. not bit deal, but there are 1K files, over 2Gb data and search results less than 1 second. – Coronal 31/1, 2017 at 12:33

@Evert thank you for your comment. But in my case I have 3 000 000 files and total size of index is 268 Gb. Ingest attachment just eats all the RAM when it tries to proccess the file larger then 40 MB. That's why I switched to Ambar. – Milner 1/2, 2017 at 13:5

I wrote a blog post about Ingest Attachment plugin problems. Check it out: Ingest Attachment Plugin for ElasticSearch: Should You Use It? – Milner 4/4, 2017 at 13:54

@SochiX you are the developer of Ambar, nice, I understand your enthusiasm. I will give it a try, but confess... I am really happy with ES + Ingest. Cheers! – Coronal 4/4, 2017 at 14:38

Hey, I get how to create the pipeline and am able to push a document to my index. However, what if my pdf was to be a stored in a field of my django object. How do I index other fields and this pdf? – Courtmartial 22/4, 2017 at 2:19

@RishabhRanawat as in my Edit 4 you just enter the properties (fields) you need, as of the official documentation elastic.co/guide/en/elasticsearch/reference/current/…, and when indexing, just fill your post with the data as needed. Hope it was of help. – Coronal 25/4, 2017 at 12:12

when using XPUT to post pdf i get {"error":"Content-Type header [application/pdf] is not supported","status":406} any suggestions? – Changeover 2/5, 2017 at 13:36

@Changeover are you using Curl? Which version of es you are using? I suggest post a new question and post your code so we could help you better. – Coronal 2/5, 2017 at 13:54

How do you query this document ? – Balbinder 14/7, 2018 at 17:47

Hi @IlyaP, Did you find the solution to your problem for huge data? – Carpetbagger 27/7, 2020 at 19:31

@Carpetbagger yep, try Ambar! – Milner 28/7, 2020 at 9:47

@IlyaP, Ambar looks great! How does it compare with github.com/opensemanticsearch ? – Dissonancy 28/7, 2020 at 17:9

Hi, @Dissonancy it's more lightweight I guess and easy to setup/host – Milner 29/7, 2020 at 10:1

@IlyaP I created a question on Ambar github to avoid hijacking this thread. Could you comment? Thanks! github.com/RD17/ambar/issues/287 – Dissonancy 3/8, 2020 at 16:20

Good tips you have provided! – Rafa 23/12, 2021 at 14:49

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.