TL;DR: LOAD DATA INFILE
is one order of magnitude faster than multiple INSERT
statements, which are themselves one order of magnitude faster than single INSERT
statements.
I benchmark below the three main strategies to importing data from R into Mysql:
single insert
statements, as in the question:
INSERT INTO test (col1,col2,col3) VALUES (1,2,3)
multiple insert
statements, formated like so:
INSERT INTO test (col1,col2,col3) VALUES (1,2,3),(4,5,6),(7,8,9)
load data infile
statement, i.e. loading a previously written CSV file in mysql
:
LOAD DATA INFILE 'the_dump.csv' INTO TABLE test
I use RMySQL
here, but any other mysql driver should lead to similar results. The SQL table was instantiated with:
CREATE TABLE `test` (
`col1` double, `col2` double, `col3` double, `col4` double, `col5` double
) ENGINE=MyISAM;
The connection and test data were created in R
with:
library(RMySQL)
con = dbConnect(MySQL(),
user = 'the_user',
password = 'the_password',
host = '127.0.0.1',
dbname='test')
n_rows = 1000000 # number of tuples
n_cols = 5 # number of fields
dump = matrix(runif(n_rows*n_cols), ncol=n_cols, nrow=n_rows)
colnames(dump) = paste0('col',1:n_cols)
Benchmarking single insert
statements:
before = Sys.time()
for (i in 1:nrow(dump)) {
query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES (',paste0(dump[i,],collapse = ','),');')
dbExecute(con, query)
}
time_naive = Sys.time() - before
=> this takes about 4 minutes on my computer
Benchmarking multiple insert
statements:
before = Sys.time()
chunksize = 10000 # arbitrary chunk size
for (i in 1:ceiling(nrow(dump)/chunksize)) {
query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES ')
vals = NULL
for (j in 1:chunksize) {
k = (i-1)*chunksize+j
if (k <= nrow(dump)) {
vals[j] = paste0('(', paste0(dump[k,],collapse = ','), ')')
}
}
query = paste0(query, paste0(vals,collapse=','))
dbExecute(con, query)
}
time_chunked = Sys.time() - before
=> this takes about 40 seconds on my computer
Benchmarking load data infile
statement:
before = Sys.time()
write.table(dump, 'the_dump.csv',
row.names = F, col.names=F, sep='\t')
query = "LOAD DATA INFILE 'the_dump.csv' INTO TABLE test"
dbSendStatement(con, query)
time_infile = Sys.time() - before
=> this takes about 4 seconds on my computer
Crafting your SQL query to handle many insert values is the simplest way to improve the performances. Transitioning to LOAD DATA INFILE
will lead to optimal results. Good performance tips can be found in this page of mysql documentation.
mysqlimport
or theLOAD DATA INFILE
syntax. Another strategy to gain speed would be to lock the table before importing the data... – Hothead