Solr query - Is there a way to limit the size of a text field in the response
Asked Answered
T

5

5

Is there a way to limit the amount of text in a text field from a query? Here's a quick scenario....

I have 2 fields:

  • docId - int
  • text - string.

I will query the docId field and want to get a "preview" text from the text field of 200 chars. On average, the text field has anything from 600-2000 chars but I only need a preview.

eg. [mySolrCore]/select?q=docId:123&fl=text

Is there any way to do it since I don't see the point of bringing back the entire text field if I only need a small preview?

I'm not looking at hit highlighting since i'm not searching for specific text within the Text field but if there is similar functionaly of the hl.fragsize parameter it would be great!

Hope someone can point me in the right direction!

Cheers!

Tarrant answered 25/1, 2011 at 11:16 Comment(1)
#3453165Travancore
B
6

You would have to test the performance of this work-around versus just returning the entire field, but it might work for your situation. Basically, turn on highlighting on a field that won't match, and then use the alternate field to return the limited number of characters you want.

http://solr:8080/solr/select/?q=*:*&rows=10&fl=author,title&hl=true&hl.snippets=0&hl.fl=sku&hl.fragsize=0&hl.alternateField=description&hl.maxAlternateFieldLength=50

Notes:

  • Make sure your alternate field does not exist in the field list (fl) parameter
  • Make sure your highlighting field (hl.fl) does not actually contain the text you want to search

I find that the cpu cost of running the highlighter sometimes is more than the cpu cost and bandwidth of just returning the whole field. You'll have to experiment.

Blakeley answered 28/1, 2011 at 19:41 Comment(2)
how the hell did you come up with this! Tested it out and it works! :) I'll be testing this out more but since we will be getting a beast of a server to run this on (>24gb ram, 8 cores min), i'm not going to be too worried by the extra cpu cost just yet. Like I mentioned in my previous comments.... would be handy for a more simple approach to specify the return field length instead of using the highlighter functionality! Thnx to all for the suggestions.... much appreciated!Tarrant
If someone wants to allow to query the text field (not the OP's question, but could be useful to others), among other fields, but still wants to return a few characters of the text field's content when there is no match in this field's highlight, you can use a per-field override on alternateField, e.g. f.myTextFieldName.hl.alternateField=myTextFieldName&f.myTextFieldName.hl.maxAlternateFieldLength=200Tass
N
3

I decided to turn my comment into an answer.

I would suggest that you don't store your text data in Solr/Lucene. Only index the data for searching and store a unique ID or URL to identify the document. The contents of the document should be fetched from a separate storage system.

Solr/Lucene are optimized for searches. They aren't your data warehouse or database, and they shouldn't be used that way. When you store more data in Solr than necessary, you negatively impact your entire search system. You bloat the size of indices, increase replication time between masters and slaves, replicate data that you only need a single copy of, and waste cache memory on document caches that should be leveraged to make search faster.

So, I would suggest 2 things.

First, optimally, remove the text storage entire from your search index. Fetch the preview text and whole text from a secondary system that is optimized for holding documents, like a file server.

Second, sub-optimal, only store the preview text in your search index. Store the entire document elsewhere, like a file server.

Neediness answered 25/1, 2011 at 19:36 Comment(5)
I don't entirely agree with this answer - it depends! How much data are you storing/indexing with SOLR? For modest document collections you can easily get away with storing text for both searching and retrieval. We are serving 28 million records (20+ fields for each rec) from a single tomcat running in 8GB of memory with no problems. Just keep an eye on memory usage and decide when you need to cut down your SOLR fields or shard, etc.Polar
@Polar - Everything "depends" :) - However, the OP turned down the idea of storing a 200 character text preview snippets in his index because he's already stored the full text AND he's got an index that's a "couple of terrabytes" in size. In that case, I believe that storing the entire text in the search index is impeding his flexibility. I'd be willing to bet it's also impacting him in other areas. Using Solr in a master slave configuration with any sort of redundancy or failover and a couple of terabytes rapidly turns into 10s of terabytes, all redundant copies.Neediness
Yep agreed then - didn't see the "a couple terrabytes" comment as it was not in the original question. BTW I tried (unsuccessfully) to find an article I read (ages ago) that talked about a large site that exclusively used SOLR for data access and searching - just to play devil's advocate :-)Polar
@Polar @Neediness thnx guys for the comments! This problem I am having is just a scenario in the greater scheme of things. The text field is searched on in other scenario's, just not this case. This is why i am reluctant to add another "preview field" since the data is already there. opening each file and returning the first 300 chars is what our old system does but it can be so much quicker just with solr... especially since the data is already there and indexed. If only SOLR had the option to minimise the length of a string field.. similar to the hl functionality!!!Tarrant
@Tarrant - Just had a different idea, though I will say it's a very ugly (as in, don't do this at home). What if you stored the document as 2 fields. The preview, and the remainder. Your storage size wouldn't go up, you'd be indexing the same data, and you could always fetch the preview.Neediness
N
0

you can add an additional field like excerpt/summary that consist the first 200 chars on text, and return that field instead

Nap answered 25/1, 2011 at 11:22 Comment(3)
Thnx... did think of that but our index size is already a couple terrabytes so this will only add to the size which isn't an option i'm afraid...Tarrant
if you already in terra-bytes, adding another few giga-bytes does not hurtNap
More on that. In my experience a bloated index is often that way because fields are stored in the index that aren't necessary. Lucene is your search index, not your data warehouse. If you don't store stuff in Lucene that isn't absolutely necessary, you will reduce the size of your index considerably. You should index the fields that need to be searched, then store an ID or URL for fetching the original documents from another storage medium. Otherwise, you are replicating your data multiple times and you end up with responses like "We can't do that because we already have too much data"Neediness
G
0

My wish, which I suspect is shared by many sites, is to offer a snippet of text with each query response. That upgrades what the user sees from mere titles or equivalent. This is normal (see Google as an example) and productive technique. Presently we cannot easily cope with sending the entire content body from Solr/Lucene into a web presentation program and create the snippet there, together with many others in a set of responses as that is a significant network, CPU, and memory hog (think of dealing with many multi-MB files).

The sensible thing is for Solr/Lucene to have a control for sending only the first N bytes of content upon request, thereby saving a lot of trouble in the field. Kludges with hightlights and so forth are just that, and interfere with proper usage. We keep in mind that mechanisms feeding material into Solr/ucene may not be parsing the files, so those feeders can't create the snippets.

Grummet answered 28/1, 2017 at 15:38 Comment(0)
B
-2

Linkedin real time search http://snaprojects.jira.com/browse/ZOIE

For storing big data http://project-voldemort.com/

Brad answered 25/1, 2011 at 19:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.