elasticsearch disable term frequency scoring
Asked Answered
T

2

8

I want to change the scoring system in elasticsearch to get rid of counting multiple appearances of a term. For example, I want:

"texas texas texas"

and

"texas"

to come out as the same score. I had found this mapping that elasticsearch said would disable term frequency counting but my searches do not come out as the same score:

"mappings":{
"business": {   
   "properties" : {
       "name" : {
          "type" : "string",
          "index_options" : "docs",
          "norms" : { "enabled": false}}
        }
    }
}

}

Any help will be appreciated, I have not been able to find a lot of information on this.

I am adding my search code and what gets returned when I use explain.

My search code:

Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "escluster").build();
    Client client = new TransportClient(settings)
    .addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));

    SearchRequest request =  Requests.searchRequest("businesses")
            .source(SearchSourceBuilder.searchSource().query(QueryBuilders.boolQuery()
            .should(QueryBuilders.matchQuery("name", "Texas")
            .minimumShouldMatch("1")))).searchType(SearchType.DFS_QUERY_THEN_FETCH);
    
    ExplainRequest request2 = client.prepareIndex("businesses", "business")

and when I search with explain I get:

  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9U5KBks4zEorv9YI4n",
      "_score" : 1.0,
      "_source":{
"name" : "texas"
}
,
      "_explanation" : {
        "value" : 1.0,
        "description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0"
            } ]
          }, {
            "value" : 1.0,
            "description" : "idf(docFreq=2, maxDocs=3)"
          }, {
            "value" : 1.0,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    }, {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9U5K6Ks4zEorv9YI4o",
      "_score" : 0.8660254,
      "_source":{
"name" : "texas texas texas"
}
,
      "_explanation" : {
        "value" : 0.8660254,
        "description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 0.8660254,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.7320508,
            "description" : "tf(freq=3.0), with freq of:",
            "details" : [ {
              "value" : 3.0,
              "description" : "termFreq=3.0"
            } ]
          }, {
            "value" : 1.0,
            "description" : "idf(docFreq=2, maxDocs=3)"
          }, {
            "value" : 0.5,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    } ]
  }
    

It looks like it is still considering frequency and doc frequency. Any ideas? Sorry for the bad formatting I don't know why it is appearing so grotesque.

My code from the browser search http://localhost:9200/businesses/business/_search?pretty=true&qname=texas is:

    {
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YcCKjKvtg8NgyozGK",
      "_score" : 1.0,
      "_source":{"business" : {
"name" : "texas texas texas texas" }
}
    }, {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YateBKvtg8Ngyoy-p",
      "_score" : 1.0,
      "_source":{
"name" : "texas" }

    }, {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YavVnKvtg8Ngyoy-4",
      "_score" : 1.0,
      "_source":{
"name" : "texas texas texas" }

    }, {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9Yb7NgKvtg8NgyozFf",
      "_score" : 1.0,
      "_source":{"business" : {
"name" : "texas texas texas" }
}
    } ]
  }
}

It finds all 4 objects I have in there and has them all the same score. When I run my java API search with explain I get:

    {
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.287682,
    "hits" : [ {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YateBKvtg8Ngyoy-p",
      "_score" : 1.287682,
      "_source":{
"name" : "texas" }
,
      "_explanation" : {
        "value" : 1.287682,
        "description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 1.287682,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0"
            } ]
          }, {
            "value" : 1.287682,
            "description" : "idf(docFreq=2, maxDocs=4)"
          }, {
            "value" : 1.0,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    }, {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YavVnKvtg8Ngyoy-4",
      "_score" : 1.1151654,
      "_source":{
"name" : "texas texas texas" }
,
      "_explanation" : {
        "value" : 1.1151654,
        "description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 1.1151654,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.7320508,
            "description" : "tf(freq=3.0), with freq of:",
            "details" : [ {
              "value" : 3.0,
              "description" : "termFreq=3.0"
            } ]
          }, {
            "value" : 1.287682,
            "description" : "idf(docFreq=2, maxDocs=4)"
          }, {
            "value" : 0.5,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    } ]
  }
}
Terrorstricken answered 24/8, 2015 at 22:38 Comment(6)
the mismatch is probably got more to do with doc frequency rather than term frequency are you using search_type=dfs_query_then_fetch . If that doesn't help try setting explain=true in the query to see the breakdown in scoringChock
I switched it to dfs_query_then_fetch but that didn't work. I will post my code and explain results in a secondTerrorstricken
could you post the query too ?Chock
I'm sorry, what do you mean? I just execute the SearchRequest from above with: ActionFuture af = client.search(request);Terrorstricken
And thank you for the formatting edit!Terrorstricken
oh my bad did not realise the query is in the code snippet could you print the actual query dsl the code generates ,explain seems to suggest the query is against the _all fieldChock
C
5

Looks like one cannot override the index options for a field after the field has been initial set in mapping

Example:

put test
put test/business/_mapping
{

      "properties": {
         "name": {
            "type": "string",
           "index_options": "freqs",
            "norms": {
               "enabled": false
            }
         }
      }

}
put test/business/_mapping
{

      "properties": {
         "name": {
            "type": "string",
            "index_options": "docs",
            "norms": {
               "enabled": false
            }
         }
      }

}
get  test/business/_mapping

   {
   "test": {
      "mappings": {
         "business": {
            "properties": {
               "name": {
                  "type": "string",
                  "norms": {
                     "enabled": false
                  },
                  "index_options": "freqs"
               }
            }
         }
      }
   }
}

You would have to recreate the index to pick up the new mapping

Chock answered 25/8, 2015 at 14:20 Comment(13)
Well this is embarrasing, that was my own stupidity, I was testing just using my browser with the command: localhost:9200/businesses/…, after I change it to "qname=texas" it works, the scores are the same. So why doesn't it work with my java API search, where it seems like I am searching the name field?Terrorstricken
could you paste the whole snippet or better the response with explain set in java clientChock
I'm sorry I am not sure how to set it in javaAPI, it doesn't seem to be an option with SearchRequest. I will update my OP with the code.Terrorstricken
I changed to SearchResponse to be able to use explain, updating OP again and overwriting from previous edit. It looks like when i'm using the java API its not hitting the settings that should ignore the frequencies.Terrorstricken
strange could you try this http://localhost:9200/businesses/business/_search?pretty=true&q=name:texas&search_type=dfs_query_then_fetch&explain=true in browser and see if you still get the same score ? I have a feeling probably the mapping wasn't applied or was applied post indexing the documentsChock
That new search gives me the same results as my java api. And regarding the mappings, why it be working for one search but not the other when it is on the same documents? I set the mapping before indexing anything.Terrorstricken
the previous http://localhost:9200/businesses/business/_search?pretty=true&qname=texas has wrong syntax and elasticsearch unfortunately instead of throwing an error ignores the wrong url params` . It defaults to match all .This is the reason all documents have the same score. You can try with http://localhost:9200/businesses/business/_search?pretty=true&qname=thiscannotbeinthedocument and you should get the same as previous result . it looks very likely the mapping wasn't applied correctly try http://localhost:9200/businesses/business/_mappingChock
Wow you're right on all counts it looks like... same results, and the current mapping is not what I put in, it looks like the default assignment that elasticsearch gives. When I am submitting the mapping it gives me an all good response, I don't remember what it is exactly but its something like acknowledged: true. Maybe I am putting it in the wrong place?Terrorstricken
You are on to something , updated answer actually looks like once the index has been created and field specified int he mapping you cannot override it with mapping call . Don't think it is mentioned in the documents though so probably you can raise an issue with elasticsearch since it should atleast raise an error rather than silently failChock
I am just using it on a test elasticsearch right now, so I am deleting the index, adding the mapping to "businesses" and then adding little test objects. Is there something different I can be doing when adding the mapping initially?Terrorstricken
You were right, I was using the wrong way to map it. I'll update my post above with my working mapping, thank you so much!!Terrorstricken
Is there a way to add "index_options" : "freqs" to all fields, not just the "name" field? I'm looking for something like "*" instead of "name"Madonna
should be able to achieve it using dynamic templatesChock
E
0

your field type must be text

you must re-indexing elasticsearch - create a new index

"mappings": {
    "properties": {
      "text": {
        "type": "text",
        "index_options": "docs"
      }
    }
  }

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-options.html

Endometriosis answered 30/5, 2021 at 13:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.