In R, I have some rather unusually formatted strings in a data.frame
column that I would like to split by commas (i.e., each element of the column is a string of individual values separated by commas; the strings come from a postgresql
query, but the question is more general and for various reasons I cannot fix this is the postgresql
query itself). The data frame x
containing the strings looks as follows:
> x
v
1 {'a',b}
2 {"abc, def","the \\"big\\" one",fgh}
(note: the double slash \\
in the quoted content above is just escaping a single \
in the actual string)
Double quotes are put around any elements of the comma-separated list of values if the value itself contains a space, double quote or comma, but if values do not use these special characters, no double quotes are used. Literal double-quotes in an individual value are indicated as a literal \"
. These characteristics make splitting the values by commas very difficult.
I would like to split these values so that {'a',b}
becomes c("'a'","b")
and {"abc, def","the \\"big\\" one",fgh}
becomes c("abc, def","the \"big\" one","fgh")
. I have found a sort of solution to this problem (see below for an improved solution), but it is Ugly with a capital U and I am wondering if there is a better way. My solution is:
# Code to create data.frame using postgresql query
con <- DBI::dbConnect(RPostgres::Postgres())
q = "WITH a(v) AS (VALUES(ARRAY['''a''','b']::varchar[]),(ARRAY['abc, def','the \"big\" one','fgh']::varchar[])) SELECT * FROM a"
x <- DBI::dbGetQuery(con,q)
# Split quoted strings avoiding literal quotes within values
xx = gsub("[\\][\"]","%myspecialquote%",x$v) # Protect quoted quotes
y = stringi::stri_extract_all_regex(xx, '(?<=[{"]).*?(?=["}])')
yy = lapply(y,\(.) gsub("%myspecialquote%","\"",.)) # Put back quotes
z = lapply(yy,\(.) .[.!="," & .!=""]) # Remove empty strings and lone commas
# Split what wasn't split already
w = lapply(1:length(z),\(.) {
t = ifelse(grepl("\"",x$v[.]),z[.],strsplit(z[[.]],","))[[1]]
t = lapply(t,\(.) ifelse(grepl("^,",.) | grepl(",$",.),(gsub("^[,]|[,]$","",.) |> strsplit(","))[[1]],.))
return(t)
})
# Put humpty-dumpty back together again
v=sapply(w,\(.) {
if(is.list(.))
. = do.call(c,.)
return(.)
})
UPDATED SOLUTION
Based on the suggestions below, I propose the following as an answer to my question that should deal with most of the bizarre edge cases (provided the text %myspecialquote%
does not appear in any of the values):
x <- data.frame(v = c("{'a',b}", '{"abc, def","the \\"big\\" one",fgh}'))
out <- transform(x, v = lapply(v, function(x) x |>
gsub("^[{]|[}]$", "", x = _) |>
gsub("[\\][\"]","%myspecialquote%",x = _) |>
scan(text = _, quote = '"', what = "", sep = ",", quiet = TRUE) |>
gsub("%myspecialquote%","\"",x = _)
))
For those that are tidyverse
inclined, you can replace transform
with mutate
.
Thanks for all the help!
{}
inside the values themselves, so we only want to strip these at the start and the end. I believe this can be accomplished replacinggsub("[{}]", "", x = _)
withgsub("^[{]|[}]$", "", x = _)
. Second, the finalchartr
is going to replace all slashes with double quotes, but again these could potentially be inside values. In this case, I am less certain how to fix it, but it would seem to me that replacing slash with something highly unlikely beforehand is workable. – Ectropion