Find dates that fail to parse in R Lubridate
Asked Answered
M

8

17

As a R novice I'm pulling my hair out trying to debug cryptic R errors. I have csv that containing 150k lines that I load into a data frame named 'date'. I then use lubridate to convert this character column to datetimes in hopes of finding min/max date.

  dates <- csv[c('datetime')]
  dates$datetime <- ymd_hms(dates$datetime)

Running this code I receive the following error message:

Warning message:
3 failed to parse. 

I accept this as the CSV could have some janky dates in there and next run:

min(dates$datetime) 
max(dates$datetime)

Both of these return NA, which I assume is from the few broken dates still stored in the data frame. I've searched around for a quick fix, and have even tried to build a foreach loop to identify the problem dates, but no luck. What would be a simple way to identify the 3 broken dates?

example date format: 2015-06-17 17:10:16 +0000
Mordecai answered 12/2, 2016 at 18:40 Comment(6)
You can check whether the format is consistent. Also check ?guess_formatsAdze
If you sort on the datetime vector, would the NA's be either at the head or the tail?Lastex
@Adze I attempted to do this in excel by import csv and checking the length of each cell and filtering for anything out of place. This didn't result in anything, is there a way to do this in R? I've attempted guess_formats & parse_date_time and both have failed.Mordecai
@Lastex That was it! I sorted the unparsed raw csv, then tail(dates) displayed the three incorrect fields. Thanks for the help!Mordecai
You could also find the row numbers of the missing dates with which(is.na(dates$datetime)).Carrington
@Carrington this is exactly what I was looking for, quick and easy.Mordecai
M
14

Credit to LawyeR and Stibu from above comments:

  1. I first sorted the raw csv column and did a head() & tail() to find which 3 dates were causing trouble
  2. Alternatively which(is.na(dates$datetime)) was a simple one liner to also find the answer.
Mordecai answered 12/2, 2016 at 19:16 Comment(3)
This is great, but doesn't really answer the general question. What if the problem is that the character 'purpleElephant' is in your data? It's not an NA yet is still unparseable. We still need some way to view the warnings that are given by Lubridate.Illailladvised
The question was about identifying the three broken dates and this accomplishes that perfectly.Suppository
But it only accomplishes that because the 3 dates happened to be NAs. I have a vector of 93 dates/datetimes which contains ~17 NAs and am getting "2 failed to parse". So this solution doesn't solve the generic problem, just the problem in OP's case.Saintjust
G
4

Lubridate will throw that error when attempting to parse dates that do not exist because of daylight savings time.

For example:

library(lubridate)
mydate <- strptime('2020-03-08 02:30:00', format = "%Y-%m-%d %H:%M:%S")
ymd_hms(mydate, tz = "America/Denver")

[1] NA
Warning message:
 1 failed to parse. 

My data comes from an unintelligent sensor which does not know about DST, so impossible (but correctly formatted) dates appear in my timeseries.

Greene answered 3/7, 2020 at 5:38 Comment(0)
V
4

Here is a simple function that solves the generic problem:

parse_ymd = function(x){
  d=lubridate::ymd(x, quiet=TRUE)
  errors = x[!is.na(x) & is.na(d)]
  if(length(errors)>0){
    cli::cli_warn("Failed to parse some dates: {.val {errors}}")
  }
  d
}

x = c("2014/20/21", "2014/01/01", NA, "2014/01/02", "foobar")
my_date = lubridate::ymd(x)
#> Warning: 2 failed to parse.
my_date = parse_ymd(x)
#> Warning: Failed to parse some dates: "2014/20/21" and "foobar"

Created on 2022-09-29 with reprex v2.0.2

Of course, replace ymd() with whatever you want.

Vltava answered 29/9, 2022 at 13:13 Comment(2)
It eliminates the failure warning but it does not show where the failure happens if there is a lot of data (I changed options(max.print = .Machine$integer.max) to avoid printing limitation however still the message does not appear ). It solved my problem.Nuclide
Using this function I was able to determine that I had an impossible date of 2000-11-38 which I fixed and no longer had the parse failure.Radial
I
1

If the indices of where lubridate fails are useful to know, you can use a for loop with stopifnot() and print each successful parse.

Make some dates, throw an error in there at a random location.

library(lubridate)
set.seed(1)
my_dates<-as.character(sample(seq(as.Date('1900/01/01'), 
as.Date('2000/01/01'), by="day"), 1000))
my_dates[sample(1:length(my_dates), 1)]<-"purpleElephant"

Now use a for loop and print each successful parse with stopifnot().

for(i in 1:length(my_dates)){
   print(i)
   stopifnot(!is.na(ymd(my_dates[i])))
}

Irairacund answered 18/9, 2020 at 17:3 Comment(0)
U
1

To provide a more generic answer, first filter out the NAs, then try and parse, then filter only the NAs. This will show you the failures. Something like:

dates2 <- dates[!is.na(dates2$datetime)]
dates2$datetime <- ymd_hms(dates2$datetime)

Warning message:
 3 failed to parse.

dates2[is.na(dates2$datetime)]
Unilobed answered 16/6, 2021 at 3:43 Comment(0)
C
0

Use the truncate argument. The most common type of irregularity in date-time data is the truncation due to rounding or unavailability of the time stamp.

Therefore, try truncated = 1, then potentially go up to truncated = 3:

  dates <- csv[c('datetime')]
  dates$datetime <- ymd_hms(dates$datetime, truncated = 1)
Cocaine answered 3/11, 2020 at 20:9 Comment(0)
M
0

ludridate::ymd_hms() expects a certain number (usually 19) of characters when it attempts to parse a timestamp string. Sometimes, timestamps that fall at midnight are rendered as just the date portion, meaning the string has only 10 characters. There are other reasons such as simple data errors that may result in a non-standard string length but NA-based operations won't find these.

To see if you have some anomalies in string lengths, you can use

table(nchar(dates$datetime))

If you find entries with problem strings, you can identify the rows using

which(nchar(dates$datetime) != 19)

Marceline answered 12/10, 2023 at 14:17 Comment(0)
M
0

In case the error is more subtle in the sense that there are no NA but different date formats that still (wrongly) parse one can try something like:

dates <- lubridate::parse_date_time(dates, orders = c("ymd", "mdy", "dmy"))
Minnaminnaminnie answered 3/8, 2024 at 11:26 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.