Handling large number of ids in Solr
Asked Answered
D

4

6

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.

How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like

&fq=-id:(id1 id2 id3 ............id5000)

The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.

One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).

Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.

Donaghue answered 1/5, 2013 at 9:1 Comment(1)
This is a problem for majority of solr users and i guess they have done nothing in solr4.0. here you need an expert in java or solr internal expertCoumarone
D
3

With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.

Dissolution answered 5/5, 2013 at 17:49 Comment(4)
I would also try this first, as it is easier to do than implementing a PostFilter and keep some sort of memcache up to date with the users currently online. More details about NearRealtimeSearch can be found in Solr's Wiki wiki.apache.org/solr/NearRealtimeSearch But if this does not work out, I would go the way lexk and Asaf have described.Heraldic
i dnt think this will be an idea solution as i already mentioned indexing is not possible and it generally takes 15-30 minutesDonaghue
You should not re-create the whole index according to this idea. You can update single entities also. In your case, if a user is logging in only his user-record - his single record - gets updated. To do so, you can send e.g. json or xml update requests to your solr server. For references have a look here yonik.com/solr/atomic-updates or wiki.apache.org/solr/UpdateJSON or solr.pl/en/2012/07/09/solr-4-0-partial-documents-updateHeraldic
See also wiki.apache.org/solr/Atomic_Updates ... if all your fields are set to "stored" you can update fields of documents individually in Solr4.Dissolution
S
3

We worked around this issue by implementing Sharding of the data.

Basically, without going heavily into code detail:

  • Write your own indexing code
    • use consistent hashing to decide which ID goes to which Solr server
    • index each user data to the relevant shard (it can be a several machines)
    • make sure you have redundancy
  • Query Solr shards
    • Do sharded queries in Solr using the shards parameter
    • Start an EmbeddedSolr and use it to do a sharded query
    • Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard

Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.

For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.

Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.

Serow answered 6/5, 2013 at 9:14 Comment(0)
P
2

Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing. You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.

You can find a good example of filter implementation here: http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

Postfree answered 6/5, 2013 at 5:56 Comment(3)
how to create own filter, this is a solution i am looking for but dont know how to create your own filterDonaghue
Also how can i connect mysql from that filter as i am a php developer, no idea how to do it using phpDonaghue
I added a link to filter implementation example.Postfree
Y
0

one solution can be use of join in solr but online data change regularly and i cant index data everytime(say 5-10 min, it should be at-least an hr)

I think you could very well use Solr joins, but after a little bit of improvisation.

The Solution, I propose is as follows:

You can have 2 Indexes (Solr Cores)

 1. Primary Index (The one you have now) 
 2. Secondary Index with only two fields , "ID" and "IS_ONLINE"

You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.

NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.

You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Yesman answered 7/5, 2013 at 5:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.