Copy Lucene indexes between Jackrabbit repositories
Asked Answered
O

1

8

I have two Jackrabbit instances containing the same content. Rebuilding the Lucene index is slow, 30+ hours, and the down-time needed in the cluster is risky. Is it possible to instead just re-index one Jackrabbit then copy the Lucene index from that instance to the other?

Naively copying the Lucene index files beneath the workspace directory doesn't work. The issue appears to be that the content is indexed by document number which maps to a UUID which maps to the JCR path for the indexed node, but these UUIDs are not stable for a given path between Jackrabbit instances. (Both are actually Day CQ publisher instances populated by replication from a CQ author instance.)

I've managed to find the UUID-to-path mapping in the repository under /jcr:system/jcr:versionStorage/ but I can't see an easy way to copy this between repositories along with the Lucene index. And then I can't find the UUID->document ID mapping anywhere in the files - is this part of the Lucene index too?

Thanks for any help. I'm leaning towards just re-indexing the second instance separately and accepting the downtime but any ideas to reduce risk or the elapsed time of reindexing the cluster appreciated!


In the end we're going the re-index-them-both route: we've managed to repurpose a test instance as an extra live instance that we can drop into the farm temporarily whilst we take the other two out in turn to re-index. However I'd still be interested in hearing better ways to do this!

Oxpecker answered 3/7, 2012 at 10:37 Comment(2)
Please take a look at this post - though maybe you've already seen it. #670682Practitioner
Thanks. No, I don't think any of those are relevant for me: it's the embedded search engine so I can't switch to Solr and the other answers discus copying the index files which isn't enough for me. I need to somehow combine either the node path data with the index and copy that, then rebuild the path -> UUID -> document number mapping at the other, or somehow transform the copied index to use the document numbers on the target system on the source system.Oxpecker
L
2

That seems like a scary idea, honestly. I'm not sure there is any way to guarantee that you've got the same underlying data, even with identical content and hardware configuration.

If your performance numbers look like ours, the time to copy the entire repository is less than the time it takes to reindex. Have you considered just reindexing one repository, doing a backup/copy, and then configuring the backup/copy to be your second instance?

Lumber answered 9/8, 2012 at 3:11 Comment(2)
Thanks - no, that hadn't occurred to me, that's a good idea. Yes rsyncing two repositories is quicker than a re-index, but when we rsync live to a test machine we always end up with a few glitches. Our repository is too big and we don't have enough storage to try using CQ's various hot-backup-and-restore options, so I think we'd have to take down the copy source server as well as the copy destination server to try this, and then we're back to only one machine in the live cluster whilst the copy is taking place. However I'll run this past the team!Oxpecker
If you look in to how the CQ online backup works, it basically does a series of rsyncs. Each iteration has less to copy and then it locks on the last one. I've had pretty good luck using repeated rsyncs to do the same thing to copy a running server. Obviously that works best if the server being copied isn't seeing a lot of writes.Lumber

© 2022 - 2024 — McMap. All rights reserved.