How do you configure Lucene in Sitecore to only index the latest version of an item on the master db?
Asked Answered
S

6

6

I recognise this is a moot point on the web database, so this question applies to the master db...

I have a custom index set up in Sitecore 6.4.1 as follows:

<index id="search_content_US" type="Sitecore.Search.Index, Sitecore.Kernel">
    <param desc="name">$(id)</param>
    <param desc="folder">_search_content_US</param>
    <Analyzer ref="search/analyzer" />
    <locations hint="list:AddCrawler">
        <search_content_home type="Sitecore.Search.Crawlers.DatabaseCrawler, Sitecore.Kernel">
            <Database>master</Database>
            <Root>/sitecore/content/usa home</Root>
            <Tags>home content</Tags>
        </search_content_home>
    </locations>
</index>

I query the index like this (I am using techphoria414's SortableIndexSearchContext from this answer: How to sort/filter using the new Sitecore.Search API):

private SearchHits GetSearchResults(SortableIndexSearchContext searchContext, string searchTerm)
    {
        CombinedQuery query = new CombinedQuery();
        query.Add(new FullTextQuery(searchTerm), QueryOccurance.Must);
        return searchContext.Search(query, Sort.RELEVANCE);
    }

...

SearchHits hits = GetSearchResults(searchContext, searchTerm);

hits is a collection of search hits from my index. When I iterate through hits I can see that there are many duplicates of the same items in Sitecore, 1 per version of the item.

I then do the following to get a SearchResultCollection:

SearchResultCollection results = hits.FetchResults(0, hits.Length);

This combines all of the duplicates into a single SearchResult object. This object represents 1 version of a particular item, and has a property called SubResults which is a collection of SearchResults that represent all of the other item versions.

Here's my problem:

The version of the item represented by the SearchResult is NOT the current published version of the item! It appears to be a randomly selected version (whichever the search method hit first in the index). The latest version is included in the SubResults collection, however.

E.g.:

SearchResult
 |
 |- Version 8 // main result
 ...
 |- SubResults
      |
      |- Version 9 // latest version
      |- Version 3
      |- Version 5
      ... // all versions in random order

How do I prevent this from happening on the master db? Either by preventing Lucene from indexing old versions of items, or by doing some manipulation of the result set to get the latest version from the SubResults?

As an aside, why does Lucene bother to index old versions of items anyway? Surely this is pointless for searching content on your website as the old versions are not visible?

Schuss answered 4/12, 2012 at 11:28 Comment(0)
L
10

You can implement a custom crawler that overrides the following:

public class IndexCrawler : DatabaseCrawler
{
    protected override void IndexVersion(Item item, Item latestVersion, Sitecore.Search.IndexUpdateContext context)
    {
        if (item.Versions.Count > 0 && item.Version.Number != latestVersion.Version.Number)
            return;

        base.IndexVersion(item, latestVersion, context);
    }
}

This ensures that only the latest version of an item gets into your Index, and therefore will be the only item pull out of said index

You would need to update your configuration file to set the correct type for the index of course

Lambrecht answered 6/2, 2013 at 17:11 Comment(2)
Thanks @Andrew - this looks like just the thing. I haven't had chance to implement it yet, but when I do I'll mark this as the answer if it works out.Schuss
Fancy seeing you here, Mr. Burgess. This answer just helped me out :)Austronesia
B
8

In Sitecore 7 a field _latestversion was added to the index, containing a '1' for the latest version (other versions have empty value).

Bolshevik answered 19/8, 2013 at 13:6 Comment(1)
This is great news! Shame we are not on Sitecore 7 yet, but it's very useful to know. Thank you.Schuss
D
7

If you let Lucene search in your Web database instead of the Master, it should only index the last published version.

<Database>web</Database>
Drubbing answered 4/12, 2012 at 11:51 Comment(3)
Thanks, @Martijn, that's good news for our live site I guess, but do you know of a way to get this to work on our master db? We need the functionality locally for testing purposes and for our content editors. We have a CM and a CD server.Schuss
Hi @theyetiman, right now I don't have an out-of-the-box solution, I should dig into it. I will see if I have time for it...Drubbing
Just wanted to add that the web database only ever has the latest version of an item (as long as you aren't directly working in that database). Publishing won't version items, just apply the latest publishable versionLambrecht
S
2

Although the solution provided by theyetiman, by using an adjusted sort mechanism, is an interesting approach, it does not provide a perfect solution when the Lucene result scores for the two versions tend to differ. E.g. out of v1 with score 0.7, and v2 with score 0.5, his solution will still return the first version of the item. (At least in my tests.)

After some more digging, the most obvious solution apparently lies in implementing your own Sitecore.Pipelines.Search.SearchSystemIndex and using that one instead of the default. If you decompile that code using ILSpy or similar, you will notice the following at the bottom of the Process method:

foreach (SearchResult current in searchHits.FetchResults(0, searchHits.Length)){
  // ...
}

Each such SearchResult is actually group-by, where the first result that was returned from Lucene (thus the one with the highest score) is the main result. Hits on other versions (and also other languages) of the same item are accessible through the Subresults property of each instance; or null when there are none.

Depending on your requirements, you can adjust this part of the class to fit your needs.

Semeiology answered 4/1, 2013 at 7:46 Comment(1)
Thanks Rian. This looks pretty interesting, but I think @Andrew Burgess' answer is probably easier to implement.Schuss
S
0

Whilst I haven't figured out the exact answer (to stop Lucene indexing old versions on the master db) I have come up with an acceptable work-around...

When Lucene returns its results from the index, each hit has a field called "_id" which is formatted something like this (3 versions of the same item, where the last number is the version):

"CCB75380-4E9A-4921-99EC-65E532E330FF%en%1"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%2"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%3"
...

I'm currently sorting by Sort.RELEVANCE which is the default. This is fine if we only had one version of an item in the index, but with several almost identical versions, they all have the same relevance score and Lucene just churns them out in any order. Sitecore then takes the first instance of the item version (even if it's old).

The solution is to specify a secondary sort field. In the searchContext.Search() method, you can pass a custom Sort object.

searchContext.Search(query, new Sort(...));

By sorting by Lucene's built in Sort.RELEVANCE first, and then by the id field (descending) in the index, I can ensure that the first hit that Sitecore sees will be the latest version and not just a random one:

searchContext.Search(query, new Sort
                            (
                                new SortField[2] 
                                {
                                    SortField.FIELD_SCORE, // equivalent to Sort.RELEVANCE
                                    new SortField("_id",SortField.STRING, true) // sort by _id, descending
                                }
                            )
);

The SortField parameters are as follows:

SortField(string fieldName, int type, bool reverse)

This approach has fixed my problem, but if anyone can actually find out how to only index the latest version, please answer!

Schuss answered 5/12, 2012 at 15:3 Comment(0)
B
0

I ended up figuring out an alternate solution from the above answers,

Architecturally speaking, I think the ideal solution for this problem would be to filter out the older version results using custom code at higher level rather than removing them from the master database index altogether. you don't want to manage the way sitecore is designed to work to solve problem at hand.

Use below predicate to filter out the olderversions and retrieve only latest version

predicate.And(item=>item[Sitecore.ContentSearch.BuiltinFields.LatestVersion].Equals("1"));

Hope this helps someone !

Brigittebriley answered 23/8, 2019 at 18:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.