Elasticsearch: Highlighting hits from within attachment
Asked Answered
A

1

7

I'm having trouble getting the highlighting to work with Elasticsearch (and Tire) in a Rails app. I can successfully index PDF attachments and query them but I cannot get the highlighting to work.

Not that familiar with ES so not sure where to look to troubleshoot. Will start with mappings and a curl query but feel free to ask for more info.

class Article < ActiveRecord::Base
  include Tire::Model::Search
  include Tire::Model::Callbacks

  attr_accessible :title, :content, :published_on, :filename 

  mapping do
    indexes :id, :type =>'integer'
    indexes :title
    indexes :content
    indexes :published_on, :type => 'date'
    indexes :attachment, :type => 'attachment',
                            :fields => {
                            :name       => { :store => 'yes' },
                            :content    => { :store => 'yes' },
                            :title      => { :store => 'yes' },
                            :file       => { :term_vector => 'with_positions_offsets', :store => 'yes' },
                            :date       => { :store => 'yes' }
                          }
  end

  def to_indexed_json
    to_json(:methods => [:attachment])
  end

  def attachment
    if filename.present?
      path_to_pdf = "/Volumes/Calvin/sample_pdfs/#{filename}.pdf"
      Base64.encode64(open(path_to_pdf) { |pdf| pdf.read })
    else
      Base64.encode64("missing")
    end
  end
end

Mappings (via Curl):

$ curl -XGET 'http://localhost:9200/_mapping?pretty=true'
{
  "articles" : {
    "article" : {
      "properties" : {
        "attachment" : {
          "type" : "attachment",
          "path" : "full",
          "fields" : {
            "attachment" : {
              "type" : "string"
            },
            "title" : {
              "type" : "string",
              "store" : "yes"
            },
            "name" : {
              "type" : "string",
              "store" : "yes"
            },
            "date" : {
              "type" : "date",
              "ignore_malformed" : false,
              "store" : "yes",
              "format" : "dateOptionalTime"
            },
            "content_type" : {
              "type" : "string"
            }
          }
        },
        "content" : {
          "type" : "string"
        },
        "created_at" : {
          "type" : "date",
          "ignore_malformed" : false,
          "format" : "dateOptionalTime"
        },
        "filename" : {
          "type" : "string"
        },
        "id" : {
          "type" : "integer",
          "ignore_malformed" : false
        },
        "published_on" : {
          "type" : "date",
          "ignore_malformed" : false,
          "format" : "dateOptionalTime"
        },
        "title" : {
          "type" : "string"
        },
        "updated_at" : {
          "type" : "date",
          "ignore_malformed" : false,
          "format" : "dateOptionalTime"
        }
      }
    }
  }
}%

A query with a 'hit' inside a 125 page indexed PDF:

$ curl "localhost:9200/_search?pretty=true" -d '{
quote>   "fields" : ["title"],
quote>   "query" : {
quote>     "query_string" : {
quote>       "query" : "xerox"
quote>     }
quote>   },
quote>   "highlight" : {
quote>     "fields" : {
quote>       "attachment" : {}
quote>     }
quote>   }
quote> }'

{
  "took" : 1077,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.036417194,
    "hits" : [ {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "13",
      "_score" : 0.036417194,
      "fields" : {
        "title" : "F-E12"
      }
    } ]
  }
}%    

I was expecting a section like:

"highlight" : {
        "attachment" : [ "\nLast Year <em>Xerox</em> moved their facilities" ]
  }

Thanks for any help!

Edit2: adjusted query (changed attachment to attachment.file) to no avail:

$ curl "localhost:9200/_search?pretty=true" -d '{
  "fields" : ["title","attachment"],
  "query" : {"query_string" : {"query" : "xerox"}},
  "highlight" : {"fields" : {"attachment.file" : {}}}
}'

{
  "took" : 221,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.036417194,
    "hits" : [ {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "13",
      "_score" : 0.036417194,
      "fields" : {
        "title" : "F-E12",
        "attachment" : "JVBERi0xLjYNJeLjz9MNCjk4NSAwIG9iag08PC9MaW5lYXJpemVkIDEvTCA...\n"
      }
    } ]
  }
}

Edit3 (remove "fields"):

$ curl "localhost:9200/_search?pretty=true" -d '{
>   "query" : {"query_string" : {"query" : "xerox"}},
>   "highlight" : {"fields" : {"attachment" : {}}}
> }'

{
  "took" : 1078,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.036417194,
    "hits" : [ {
      "_index" : "articles",
      "_type" : "article",
      "_id" : "13",
      "_score" : 0.036417194, "_source" : {"content":"Real report","created_at":"2012-08-28T22:44:08Z","filename":"F-E12","id":13,"published_on":"2007-12-28","title":"F-E12","updated_at":"2012-08-28T22:44:08Z","attachment":"JVBERi0xLjYNJeLjz9MNCjk4NSAwIG9iag08PC9MaW5lYXJpemVkID...\n"
      }
    } ]
  }
}

Edit4 (mapping from Attachment Type in Action tutorial):

$ curl -XGET 'http://localhost:9200/test/_mapping?pretty=true'
{
  "test" : {
    "attachment" : {
      "properties" : {
        "file" : {
          "type" : "attachment",
          "path" : "full",
          "fields" : {
            "file" : {                #<== This appears to be missing 
              "type" : "string",      #<== from my Articles mapping
              "store" : "yes",        #<==
              "term_vector" : "with_positions_offsets"  #<==
            },
            "author" : {
              "type" : "string"
            },
            "title" : {
              "type" : "string",
              "store" : "yes"
            },
            "name" : {
              "type" : "string"
            },
            "date" : {
              "type" : "date",
              "ignore_malformed" : false,
              "format" : "dateOptionalTime"
            },
            "keywords" : {
              "type" : "string"
            },
            "content_type" : {
              "type" : "string"
            }
          }
        }
      }
    }
  }
}
Anatropous answered 29/8, 2012 at 3:30 Comment(14)
I'm afraid the mapping you posted is related to a different type: it's attachment, not article. Are you sure that the article mapping is correct? Could you also add the field attachment itself to the output?Herbart
I'm not sure anything is correct so please make suggestions. I've added the attachment method you asked for. Appreciate your help!Anatropous
Sorry, I didn't notice you were using the attachment field type! Mapping looks good! I think you should try to highlight the attachment.file field rather than attachment itself. Let me know how it went!Herbart
I changed (see "edit" above) attachment to attachment.file as per your recommendation. changes had no effect on the output. Any other ideas?!?Anatropous
Can you run the search query adding the attachment to the output using "fields" : ["title","attachment"]?Herbart
Yes, I can (See Edit2) and when I do I get the whole PDF encoded. What does this mean?Anatropous
Thanks, can you now add to the output the attachment.file field?Herbart
Appreciate all your help! Not sure what you mean, exactly, by "add to the output the attachment.file field". Sorry, I'm still trying to figure out Elasticsearch. Aside: maybe you should move this to an answer? this comment is getting pretty long…Anatropous
I don't have an answer yet, the comment are useful to understand more about your issue. I meant to use "fields" : ["title","attachment.file"] and paste the output. Probably even better if you remove the fields part so that elasticsearch returns the whole _source and you can post it.Herbart
See Edit3. Again, appreciate your time/help with this!Anatropous
Sorry if I ask, but did you install the elasticsearch-mapper-attachments plugin?Herbart
Of course. And I even have plugin.mandatory: mapper-attachments in my elasticsearch.yml config file so it won't even start up w/out it. Any way to debug this query and see why the highlighting is just being ignored? I tried rebuilding the index (rake environment tire:import CLASS='Article' FORCE=true), to no avail. It's frustrating not being able to get simple functionality working with ES so I can start tweaking it.Anatropous
Honestly I don't understand what's going on. Looks like you only have the base64 within that field, nothing else. That's why I was asking if you installed the plugin.Herbart
Did some more sleuthing and ran through the Attachments Type in Action tutorial again. I've put the output for its mapping above. Comparing it to my mapping we can see that for some reason the file field is not getting picked up in my mapping?!? So I must have a syntax error in my mapping model?!? But there are so few examples of using attachments with Tire that I'm having a hard time finding it... Thanks for your help!Anatropous
A
8

I figured it out! Finally...

problem was with my mapping syntax in Article class. Needed to rename ":file" to ":attachment".

  tire.mapping do
    indexes :id, :type =>'integer'
    indexes :title
    indexes :content
    indexes :published_on, :type => 'date'
    indexes :attachment, :type => 'attachment', #:null_value => 'missing_file',
                            :fields => {
                            :name       => { :store => 'yes' },  # exists?!?
                            :content    => { :store => 'yes' },
                            :title      => { :store => 'yes' },
  # WRONG! see next line => :file       => { :term_vector => 'with_positions_offsets', :store => 'yes' },
                            :attachment => { :term_vector => 'with_positions_offsets', :store => 'yes' },
                            :date       => { :store => 'yes' }
                          }
Anatropous answered 31/8, 2012 at 23:43 Comment(2)
I imagine this changed (back) to 'file' as of ES version 0.90.0Gallicanism
Any idea on how to do this via curl or python?Ossiferous

© 2022 - 2024 — McMap. All rights reserved.