Removing Solr duplicate values into multivalued field
Asked Answered
S

7

16

My Solr index contains a multivalued field with duplicate values. How can I remove the duplicates ?

Is it possible to overwrite duplicate values into the multivalued field when indexing ?

Thanks

Salman answered 9/11, 2012 at 10:26 Comment(0)
S
6

Really late to the party, but the top answer did not work for me in Solr 6.0 for attempting to add a duplicate entry on a multivalued field. it was missing a processor right before UniqFieldsUpdateProcessorFactory. So adding something like this to my solrconfig.xml worked:

<updateRequestProcessorChain name="uniq-fields">
<processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory"/>
<processor class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
  <str name="fieldName">YourFieldA</str>
  <str name="fieldName">yourFieldB</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />

Where YourFieldA and YourFieldB are defined fields in your schema.xml. Note that you must also add this to the proper requestHandler ie:

  <requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
  <str name="update.chain">uniq-fields</str>
</lst>

This will not only prevent duplicates from being added, but also remove all duplicates from your index upon update for the specified fields.

Sublease answered 22/10, 2016 at 0:29 Comment(0)
R
5

I was struggling to accomplish the same. This worked for me. Add the below processor to your solrconfig.xml

<updateRequestProcessorChain name="deduplicateMultiValued" default="true">
        <processor class="org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory">
            <lst name="fields">
                <str>multivaluedFieldXYZ</str>
            </lst>
        </processor>
        <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>
Rubinrubina answered 25/9, 2013 at 17:38 Comment(2)
With current version of Solr, the inner lst/str lines will become 1 single line like this: <str name="fieldName">multivaluedFieldXYZ</str>Ramage
Just to note this won't work for copyFields but only when the document gets indexed. It will remove all duplicate inserts at that point.Daemon
J
1

You would need to handle it on the Client side to remove the duplicate values.

You can customize the implementation like RemoveDuplicatesTokenFilterFactory (works for same text at same position) to filter out the tokens. Write an extension basically. OR

Also, If using the multivalued field for just faceting ,the value in faceted field is counted just once. So even if you add multiple same values, that would be reflected as a single value in the facet count entry. Have tested this. you too can confirm.

However, the duplicate values would cause the change in the lengthNorm and hence can have an effect on the scoring.

Jit answered 9/11, 2012 at 10:57 Comment(0)
M
1

I am using solrJ to bind documents, and to avoid duplicated values I defined my multivalued field as a HashSet.

@Field("description")
public Collection<String> description = new HashSet<>();
Machicolation answered 28/6, 2018 at 10:8 Comment(0)
O
1

In latest version of solr you can use add-distinct while doing atomic updates to multivalued fields.

add-distinct: Adds the specified values to a multiValued field, only if not already present. May be specified as a single value, or as a list.

(ref: https://lucene.apache.org/solr/guide/8_8/updating-parts-of-documents.html)

Overdo answered 17/2, 2021 at 15:24 Comment(0)
H
0

Or you could handle it in Solr, but in an UpdateRequestProcessor so that it happens before indexing and you don't need to learn about analysis chain.

You can use java or a number of scripting languages with the ScriptUpdateProcessor

Heathendom answered 9/11, 2012 at 11:11 Comment(0)
B
0

This configuration works for Solr 5.3.1

<updateRequestProcessorChain name="distinct-values" default="true">
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.UniqFieldsUpdateProcessorFactory">
        <str name="fieldName">field1</str>
        <str name="fieldName">field2</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>  
Billion answered 11/12, 2015 at 11:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.