Elasticsearch Parse Exception error when attempting to index PDF
Asked Answered
Z

2

15

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully.

Installed the Attachment Type plugin and got response: Installed mapper-attachments.

Followed the Attachment Type in Action tutorial but the process hangs and I don't know how to interpret the error message. Also tried the gist which hangs in the same place.

$ curl -X POST "localhost:9200/test/attachment/" -d json.file 
{"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]]","status":400}

More details:

The json.file contains an embedded Base64 PDF file (as per instructions). The first line of the file appears correct (to me anyway): {"file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8...

I'm not sure if maybe the json.file is invalid or if maybe elasticsearch just isn't set up to parse PDFs properly?!?

Encoding - Here's how we're encoding the PDF into json.file (as per tutorial):

coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file

also tried:

coded=`openssl base64 -in fn6742.pdf

log:

[2012-06-07 12:32:16,742][DEBUG][action.index             ] [Bailey, Paul] [test][0], node[AHLHFKBWSsuPnTIRVhNcuw], [P], s[STARTED]: Failed to execute [index {[test][attachment][DauMB-vtTIaYGyKD4P8Y_w], source[json.file]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
    at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:147)
    at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:50)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:437)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:290)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:210)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:680)

Hoping someone can help me see what I'm missing or did wrong?

Zahara answered 13/6, 2012 at 14:50 Comment(0)
W
20

The following error points to the source of the problem.

Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]

The UTF-8 codes [106, 115, 111, ...] show that you are trying to index string "json.file" instead of content of the file.

To index content of the file simply add letter "@" in front of the file name.

curl -X POST "localhost:9200/test/attachment/" -d @json.file
Wendolyn answered 13/6, 2012 at 19:46 Comment(3)
Ah, you are correct! Thanks for your help! But, now I've tried adding the @ in front of filename and it just hangs with no output to the log?!? I need to ctrl-C to get my shell back. Any ideas? Maybe a way to make log more helpful?Zahara
Could you run jstack and see where it hangs?Wendolyn
The needle in the haystack!Alcoholicity
Z
3

Turns out it's necessary to export ES_JAVA_OPTS=-Djava.awt.headless=true before running a java app on a 'headless' server... who would'a thought!?!

Zahara answered 21/6, 2012 at 22:9 Comment(1)
it's worth noting that this only silences the error. @imotov's answer is probably the correct answer for this problem. Another reason a Failed to derive xcontent error will pop up is when an empty payload is passed into elastic.Lance

© 2022 - 2024 — McMap. All rights reserved.