SOLR - Best approach to import 20 million documents from csv file
Asked Answered
I

5

16

My current task on hand is to figure out the best approach to load millions of documents in solr. The data file is an export from DB in csv format.

Currently, I am thinking about splitting the file into smaller files and having a script while post this smaller ones using curl.

I have noticed that if u post high amount of data, most of the time the request times out.

I am looking into Data importer and it seems like a good option

Any others ideas highly appreciated

Thanks

Interfluent answered 25/2, 2012 at 23:47 Comment(0)
M
21

Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.

As I see it, you have a couple of options (In my order of preference):

Increase container timeout

Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).

I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.

Split the file

Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):

split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do  
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @$file
done

Parse the file and load in chunks

The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.

import au.com.bytecode.opencsv.CSVReader

import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument

@Grapes([
    @Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
    @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
    @Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");

new File("data.csv").withReader { reader ->
    CSVReader csv = new CSVReader(reader)
    String[] result
    Integer count = 1
    Integer chunkSize = 500000

    while (result = csv.readNext()) {
        SolrInputDocument doc = new SolrInputDocument();

        doc.addField("id",         result[0])
        doc.addField("name_s",     result[1])
        doc.addField("category_s", result[2])

        server.add(doc)

        if (count.mod(chunkSize) == 0) {
            server.commit()
        }
        count++
    }
    server.commit()
}
Media answered 26/2, 2012 at 11:18 Comment(2)
Mark, thanks for the detailed answer. This is line with what I was trying to figure out. I will have huge loads as first timers and update deltas every hour or so. I think I know the options for the first load. However, I am not sure what would be the best manageable process for updates. I am looking for configurable robust update process. Any info around that would be highly appreciated. Thanks.Interfluent
I'd suggest an "input" directory for new files and move the files into a "processed" directory after indexing. Simple and easy to maintain.... However.... You said the CSV files are coming from a database? In that case perhaps you would need be better off using the DIH handler, if you have access to the DB.Extravasation
H
12

In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki

curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8

And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.

Halfwitted answered 11/10, 2012 at 14:26 Comment(3)
How is the performance of UpdateHandler vs DataImportHandler? Input is CSV in both the cases.Ticket
I didn't run any benchmarks, but performance of DataImportHandler is pretty good. We were doing about 20 million records resulting to an index of approximately 7GB (after optimization) in about an hour. Also we had to run some transformations to the data before adding them to the index - so couldn't really do it using the UpdateHandler.Halfwitted
Thanks! I am doing this one too and it works like a charm. about 50M records (size 8gb) in ~40mins.Ticket
P
3

Above answers have explained really well the ingestion strategies from single machine.

Few more options if you have big data infrastructure in place and want to implement distributed data ingestion pipeline.

  1. Use sqoop to bring data to hadoop or place your csv file manually in hadoop.
  2. Use one of below connector to ingest data:

hive- solr connector, spark- solr connector.

PS:

  • Make sure no firewall blocks connectivity between client nodes and solr/solrcloud nodes.
  • Choose right directory factory for data ingestion, if near real time search is not required then use StandardDirectoryFactory.
  • If you get below exception in client logs during ingestion then tune autoCommit and autoSoftCommit configuration in solrconfig.xml file.

SolrServerException: No live SolrServers available to handle this request

Period answered 13/9, 2016 at 17:4 Comment(0)
P
1

Definitely just load these into a normal database first. There's all sorts of tools for dealing with CSVs (for example, postgres' COPY), so it should be easy. Using Data Import Handler is also pretty simple, so this seems like the most friction-free way to load your data. This method will also be faster since you won't have unnecessary network/HTTP overhead.

Photina answered 25/2, 2012 at 23:50 Comment(6)
The files are being pulled out of DB. Not sure what would be the point to load them back?Interfluent
Ok, then just use Data Import Handler to read them out directly. Is there something wrong with that option?Photina
+1 I did some quick performance tests and DIH outperforms bulk commits by an order of magnitude. But that's just a first impression.Pharmaceutical
DIH did not handle pulling huge amounts of data when the DB was in a different network location from where Solr was installed.Socket connection timeout ensued.Had to pull the data from the DB into csv files in smaller chunks using opencsv.Then used FileListEntityProcessor and LineEntityProcessor to index the data into Solr.Sew
@Sew That sounds like more of an issue with your network infrastructure/DB config than with DIH or Solr.Photina
@Photina Have you pulled 20 million records from a table with DIH ?I keep reading there are tons of other issues with DIH too.Sew
S
-1

The reference guide says ConcurrentUpdateSolrServer could/should be used for bulk updates.

Javadocs are somewhat incorrect (v 3.6.2, v 4.7.0):

ConcurrentUpdateSolrServer buffers all added documents and writes them into open HTTP connections.

It doesn't buffer indefinitely, but up to int queueSize, which is a constructor parameter.

Spermiogenesis answered 31/3, 2014 at 18:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.