sqldf: query data by range of dates
Asked Answered
S

1

7

I am reading from a huge text file that has '%d/%m/%Y' date format. I want to use read.csv.sql of sqldf to read and filter the data by date at the same time. This is to save memory usage and run time by skipping many dates that I am not interested in. I know how to do this with the help of dplyr and lubridate, but I just want to try with sqldf for the aforementioned reason. Even though I am quite familiar with SQL syntax, it still gets me most of the time, no exception with sqldf.

Running command like following returned a data.frame with 0 row:

first_date <- "2001-11-1"
second_date <- "2003-11-1"
query <- "select * from file WHERE strftime('%d/%m/%Y', Date, 'unixepoch', 'localtime') between
                    '$first_date' AND '$second_date'"
df <- read.csv.sql(data_file,
                       sql= query,
                       stringsAsFactors=FALSE,
                       sep = ";", header = TRUE)

So for simulation, I tried with sqldf function like the following:

first_date <- "2001-11-1"
second_date <- "2003-11-1"
df2 <- data.frame( Date = paste(rep(1:3, each = 4), 11:12, 2001:2012, sep = "/"))
sqldf("SELECT * FROM df2 WHERE strftime('%d/%m/%Y', Date, 'unixepoch') BETWEEN '$first-date' AND '$second_date' ") 

# Expect:
# Date
# 1  1-11-2001
# 2  1-12-2002
# 3  1-11-2003
Skillless answered 7/9, 2014 at 19:49 Comment(4)
Are you sure the format in the file is not dd-mm-yyyy? Typically if day or month is 1, say, then its expressed as 01 and not 1.Puttee
Data in the text file really has dates like this 1/1/2011, not 01/01/2011. And either case, I think strftime function of SQLite driver would be able to convert to the later format. My difficulty here is that I don't know how to test the strftime function in this case.Skillless
SQLite doesn't have a date type. You might find useful this: #4429295Ramakrishna
@Ramakrishna That's true. That's why I use strftime function to convert time strings to dates on-the-fly and compare on this resulting dates. I am just not familiar with the the function to make correct call.Skillless
P
7

strftime strftime with percent codes is used to convert an object already regarded by sqlite as a datetime to something else but you want the reverse so the approach in the question is not going to work. For example, here we convert the current time into a dd-mm-yyyy string:

library(sqldf)
sqldf("select strftime('%d-%m-%Y', 'now') now")
##          now
## 1 07-09-2014

Discussion Since SQlite lacks date types its a bit onerous to handle this, particularly with the 1-or-2-digit non-standard date formats, but if you really want to use SQLite we can do it by tediously parsing out the date strings. Using fn$ from the gsubfn package for string interpolation eases this a little.

Code Below zero2d outputs SQL code to prepend a zero character to its input if its one digit. rmSlash outputs SQL code to remove any slashes in its argument. Year, Month and Day each output SQL code to take a character string representing a date in the format under discussion and extract the indicated component reformatting it as a 2 digit zero filled character string in the case of Month and Day. fmtDate takes a character string of the form shown in the question for first_string and second_string and outputs a yyyy-mm-dd character string.

library(sqldf)
library(gsubfn)

zero2d <- function(x) sprintf("substr('0' || %s, -2)", x)

rmSlash <- function(x) sprintf("replace(%s, '/', '')", x)

Year <- function(x) sprintf("substr(%s, -4)", x)

Month <- function(x) {
   y <- sprintf("substr(%s, instr(%s, '/') + 1, 2)", x, x)
   zero2d(rmSlash(y))
}

Day <- function(x) {
   y <- sprintf("substr(%s, 1, 2)", x)
   zero2d(rmSlash(y))
}

fmtDate <- function(x) format(as.Date(x))

sql <- "select * from df2 where 
  `Year('Date')` || '-' || 
  `Month('Date')` || '-' || 
  `Day('Date')`
  between '`fmtDate(first_date)`' and '`fmtDate(second_date)`'"
fn$sqldf(sql)

giving:

       Date
1 1/11/2001
2 1/12/2002
3 1/11/2003

Notes

1) SQLite functions used instr, replace and substr are core sqlite functions

2) SQL The actual SQL statement that is executed after fn$ performs the substitutions is as follows (slightly reformatted to fit):

> cat( fn$identity(sql), "\n")
select * from df2 where 
  substr(Date, -4) 
  || '-' || 
  substr('0' || replace(substr(Date, instr(Date, '/') + 1, 2), '/', ''), -2) 
  || '-' || 
  substr('0' || replace(substr(Date, 1, 2), '/', ''), -2)
  between '2001-11-01' and '2003-11-01' 

3) source of complications the main complication is the non-standard 1-or-2 digit day and month. Had they been consistently 2 digits it would have reduced to this:

first_date <- "2001-11-01"
second_date <- ""2003-11-01"

fn$sqldf("select Date from df2 
   where substr(Date, -4) || '-' || 
         substr(Date, 4, 2) || '-' ||
         substr(Date, 1, 2)
   between '`first_date`' and '`second_date`' ")

4) H2 Here is an H2 solution. H2 does have a datetime type simplifying the solution substantially over SQLite. We assume that the data is in a file called mydata.dat. Note that read.csv.sql does not support H2 as H2 already has the internal csvread SQL function to do that:

library(RH2)
library(sqldf)

first_date <- "2001-11-01"
second_date <- "2003-11-01"

fn$sqldf(c("CREATE TABLE t(DATE TIMESTAMP) AS
  SELECT parsedatetime(DATE, 'd/M/y') as DATE
  FROM CSVREAD('mydata.dat')", 
  "SELECT DATE FROM t WHERE DATE between '`first_date`' and '`second_date`'"))

Note that the first RH2 query will be slow in a session since it loads java. After that you can try it out to see if the performance is adequate.

Puttee answered 7/9, 2014 at 22:13 Comment(7)
This is quite a laborious solution. Thanks a lot for work it out for me.Skillless
Have added a note at the end pointing out that the complication is mostly from the 1-or-2 digit day and month and had they been consistently 2 digits the solution reduces substantially.Puttee
You are absolutely right. I was thinking about using Rcpp or something using external program to like sed to fix the dates before using R to work on the data. But since I am rather new to R, I want to learn how it works in R. On a side matter, I have just found out that you are one of the authors of RH2 package. My respect, and a pleasant surprise :-). Do you mind demonstrating how the work can be done with RH2 instead of SQLite? Thanks again.Skillless
From what is written here: h2database.com/html/datatypes.html#date_type . I think that we can't avoid the steps of extracting year, month, and day and reformatting the dates. So the code would not look shorter. Is that true?Skillless
Have added an H2 solution at the end.Puttee
That looks definitely cleaner. I tried with the real data and got error like: execute JDBC update query failed in dbSendUpdate (Out of memory. I ran the code in RStudio, and wanted to select all the columns. That probably hits the limit of RStudio. So, I will stay at SQLite-based solution for now.Skillless
Try this in R: options(java.parameters="-Xmx1024m") before loading any packages or use some other memory size.Puttee

© 2022 - 2024 — McMap. All rights reserved.