ElasticSearch & attachment type (NEST C#)
Asked Answered
E

4

9

I'm trying to index a pdf document with elasticsearch/NEST.

The file is indexed but search results returns with 0 hits.

I need the search result to return only the document Id and the highlight result

(without the base64 content)

Here is the code:

I'll appreciate any help here,

Thanks,

class Program
{
    static void Main(string[] args)
    {
        // create es client
        string index = "myindex";

        var settings = new ConnectionSettings("localhost", 9200)
            .SetDefaultIndex(index);
        var es = new ElasticClient(settings);

        // delete index if any
        es.DeleteIndex(index);

        // index document
        string path = "test.pdf";
        var doc = new Document()
        {
            Id = 1,
            Title = "test",
            Content = Convert.ToBase64String(File.ReadAllBytes(path))
        };

        var parameters = new IndexParameters() { Refresh = true };
        if (es.Index<Document>(doc, parameters).OK)
        {
            // search in document
            string query = "semantic"; // test.pdf contains the string "semantic"

            var result = es.Search<Document>(s => s
                .Query(q =>
                    q.QueryString(qs => qs
                        .Query(query)
                    )
                )
                .Highlight(h => h
                    .PreTags("<b>")
                    .PostTags("</b>")
                    .OnFields(
                      f => f
                        .OnField(e => e.Content)
                        .PreTags("<em>")
                        .PostTags("</em>")
                    )
                )
            );

            if (result.Hits.Total == 0)
            {
            }
        }
    }
}

[ElasticType(
    Name = "document",
    SearchAnalyzer = "standard",
    IndexAnalyzer = "standard"
)]
public class Document
{
    public int Id { get; set; }

    [ElasticProperty(Store = true)]
    public string Title { get; set; }

    [ElasticProperty(Type = FieldType.attachment,
        TermVector = TermVectorOption.with_positions_offsets)]
    public string Content { get; set; }
}
Eurystheus answered 8/2, 2013 at 21:55 Comment(3)
Also, verified that mapper-attachments plugin installed and loaded (using es.yml: plugin.mandatory: mapper-attachments). Still, no hits for words contained in my pdf. I've googled for answers on this subject (stackoverflow & others) & only came up with curl examples, no usage example using c#/NEST. (just a note: when searching the document.title ('test.pdf') I do get the document back but no hits when searching 'test'.Eurystheus
just to let you know I plan to create integration tests for this tomorrow and answer the question. I'm not able to answer sooner.Yarvis
any updates on this question?Elute
G
9

Install the Attachment Plugin and restart ES

bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.3.2

Create an Attachment Class that maps to the Attachment Plugin Documentation

  public class Attachment
  {
      [ElasticProperty(Name = "_content")]
      public string Content { get; set; }

      [ElasticProperty(Name = "_content_type")]
      public string ContentType { get; set; }

      [ElasticProperty(Name = "_name")]
      public string Name { get; set; }
  }

Add a property on the Document class you are indexing with the name "File" and correct mapping

  [ElasticProperty(Type = FieldType.Attachment, TermVector = TermVectorOption.WithPositionsOffsets, Store = true)]
  public Attachment File { get; set; }

Create your index explicitly before you index any instances of your class. If you don't do this, it will use dynamic mapping and ignore your attribute mapping. If you change your mapping in the future, always recreate the index.

  client.CreateIndex("index-name", c => c
     .AddMapping<Document>(m => m.MapFromAttributes())
  );

Index your item

  string path = "test.pdf";

  var attachment = new Attachment();
  attachment.Content = Convert.ToBase64String(File.ReadAllBytes(path));
  attachment.ContentType = "application/pdf";
  attachment.Name = "test.pdf";

  var doc = new Document()
  {
      Id = 1,
      Title = "test",
      File = attachment
  };
  client.Index<Document>(item);

Search on the File property

  var query = Query<Document>.Term("file", "searchTerm");

  var searchResults = client.Search<Document>(s => s
          .From(start)
          .Size(count)
          .Query(query)
  );
Gazzo answered 25/9, 2014 at 16:30 Comment(0)
G
1

// I am using FSRiver plugin - https://github.com/dadoonet/fsriver/

void Main()
{
    // search in document
    string query = "directly"; // test.pdf contains the string "directly"
    var es = new ElasticClient(new ConnectionSettings( new Uri("http://*.*.*.*:9200"))
        .SetDefaultIndex("mydocs")
        .MapDefaultTypeNames(s=>s.Add(typeof(Doc), "doc")));
        var result = es.Search<Doc>(s => s
        .Fields(f => f.Title, f => f.Name)
        .From(0)
        .Size(10000)
            .Query(q => q.QueryString(qs => qs.Query(query)))
            .Highlight(h => h
                .PreTags("<b>")
                .PostTags("</b>")
                .OnFields(
                  f => f
                    .OnField(e => e.File)
                    .PreTags("<em>")
                    .PostTags("</em>")
                )
            )
        );
}

[ElasticType(Name = "doc",  SearchAnalyzer = "standard", IndexAnalyzer = "standard")]
public class Doc
{
    public int Id { get; set; }

     [ElasticProperty(Store = true)]
     public string Title { get; set; }

    [ElasticProperty(Type = FieldType.attachment, TermVector = TermVectorOption.with_positions_offsets)]
    public string File { get; set; }
    public string Name { get; set; }
}
Gymnasiast answered 21/9, 2013 at 11:58 Comment(0)
C
0

I am working on the same so now i am trying this http://www.elasticsearch.cn/tutorials/2011/07/18/attachment-type-in-action.html

This article explains issue

pay attension that you should do correct mapping

 "title" : { "store" : "yes" },
 "file" : { "term_vector":"with_positions_offsets", "store":"yes" }

I will try to figure out how to do that with NEST api and update this post

Contraption answered 6/12, 2013 at 9:38 Comment(0)
H
-1

You need to add the mapping like below before you index items.

client.CreateIndex("yourindex", c => c.NumberOfReplicas(0).NumberOfShards(12).AddMapping<AssetSearchEntryModels>(m => m.MapFromAttributes()));
Horsy answered 1/5, 2014 at 4:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.