split partially quoted, comma-separated string
Asked Answered
E

2

8

In R, I have some rather unusually formatted strings in a data.frame column that I would like to split by commas (i.e., each element of the column is a string of individual values separated by commas; the strings come from a postgresql query, but the question is more general and for various reasons I cannot fix this is the postgresql query itself). The data frame x containing the strings looks as follows:

> x
                                     v
1                              {'a',b}
2 {"abc, def","the \\"big\\" one",fgh}

(note: the double slash \\ in the quoted content above is just escaping a single \ in the actual string)

Double quotes are put around any elements of the comma-separated list of values if the value itself contains a space, double quote or comma, but if values do not use these special characters, no double quotes are used. Literal double-quotes in an individual value are indicated as a literal \". These characteristics make splitting the values by commas very difficult.

I would like to split these values so that {'a',b} becomes c("'a'","b") and {"abc, def","the \\"big\\" one",fgh} becomes c("abc, def","the \"big\" one","fgh"). I have found a sort of solution to this problem (see below for an improved solution), but it is Ugly with a capital U and I am wondering if there is a better way. My solution is:

# Code to create data.frame using postgresql query
con <- DBI::dbConnect(RPostgres::Postgres())
q = "WITH a(v) AS (VALUES(ARRAY['''a''','b']::varchar[]),(ARRAY['abc, def','the \"big\" one','fgh']::varchar[])) SELECT * FROM a"
x <- DBI::dbGetQuery(con,q)

# Split quoted strings avoiding literal quotes within values
xx = gsub("[\\][\"]","%myspecialquote%",x$v) # Protect quoted quotes
y = stringi::stri_extract_all_regex(xx, '(?<=[{"]).*?(?=["}])')
yy = lapply(y,\(.) gsub("%myspecialquote%","\"",.)) # Put back quotes

z = lapply(yy,\(.) .[.!="," & .!=""]) # Remove empty strings and lone commas

# Split what wasn't split already
w = lapply(1:length(z),\(.) { 
  t = ifelse(grepl("\"",x$v[.]),z[.],strsplit(z[[.]],","))[[1]]
  t = lapply(t,\(.) ifelse(grepl("^,",.) | grepl(",$",.),(gsub("^[,]|[,]$","",.) |> strsplit(","))[[1]],.))
  return(t)
})

# Put humpty-dumpty back together again
v=sapply(w,\(.) { 
  if(is.list(.)) 
    . = do.call(c,.)
  return(.)
})

UPDATED SOLUTION

Based on the suggestions below, I propose the following as an answer to my question that should deal with most of the bizarre edge cases (provided the text %myspecialquote% does not appear in any of the values):

x <- data.frame(v = c("{'a',b}", '{"abc, def","the \\"big\\" one",fgh}'))

out <- transform(x, v = lapply(v, function(x) x |>
  gsub("^[{]|[}]$", "", x = _) |>
  gsub("[\\][\"]","%myspecialquote%",x = _) |>
  scan(text = _, quote = '"', what = "", sep = ",", quiet = TRUE) |>
  gsub("%myspecialquote%","\"",x = _)
))

For those that are tidyverse inclined, you can replace transform with mutate.

Thanks for all the help!

Ectropion answered 23/9, 2024 at 13:29 Comment(0)
S
3

Assuming that v is the character vector input in reproducible form, for each element of it remove { and } using gsub, scan the remaining text and replace backslash with double quote using chartr. No packages are used.

x <- data.frame(v = c("{'a',b}", '{"abc, def","the \\"big\\" one",fgh}'))

out <- transform(x, v = lapply(v, function(x) x |>
  gsub("[{}]", "", x = _) |>
  scan(text = _, quote = '"', what = "", sep = ",", quiet = TRUE) |>
  chartr("\\", '"', x = _)
))

str(out)
## 'data.frame':   2 obs. of  1 variable:
##  $ v:List of 2
##   ..$ : chr  "'a'" "b"
##   ..$ : chr  "abc, def" "the \"big\" one" "fgh"
Savor answered 23/9, 2024 at 15:27 Comment(2)
Thanks for this suggestion. It is close to what I want, but needs a couple of minor fixes. First, there could be {} inside the values themselves, so we only want to strip these at the start and the end. I believe this can be accomplished replacing gsub("[{}]", "", x = _) with gsub("^[{]|[}]$", "", x = _). Second, the final chartr is going to replace all slashes with double quotes, but again these could potentially be inside values. In this case, I am less certain how to fix it, but it would seem to me that replacing slash with something highly unlikely beforehand is workable.Ectropion
Sure. Those seem likely to work.Savor
B
0

I would use the following strategy:

  1. Replace all , by ; if they are not surounded by " (and assuming there is no semicolon in the original string, if there is choose any character which is not part of the original string).
  2. Remove all { and }.
  3. Replace \\ by \
  4. Split along ;
  5. Remove trailing "
library(stringr)
library(magrittr)
library(purrr)

(x <- data.frame(v = c("{'a',b}", 
                       "{\"abc, def\",\"the \\\"big\\\" one\",fgh}")))
#                                      v
# 1                              {'a',b}
# 2 {"abc, def","the \\"big\\" one",fgh}

r <- x$v %>%
  str_replace_all(',(?=(?:[^"]*"[^"]*")*[^"]*$)', ';') %>%
  str_remove_all("[{}]") %>%
  str_replace_all("\\\\", "\\") %>%
  str_split(fixed(";")) %>%
  map(~ str_remove_all(.x, '^"|"$'))

dput(r[[1]])
# c("'a'", "b")
dput(r[[2]])
# c("abc, def", "the \"big\" one", "fgh")

The most difficult part is the regex:

  1. ,: Matches a comma
  2. (?=: This is a positive lookahead that ensures the following condition is true.
  3. (?:[^"]*"[^"]*")*: A non-capturing group ((?:) that matches pairs of double quotes (skipping over content within quotes).
  4. [^"]*$: Ensures that after the comma, there are no unmatched double quotes (i.e., we are outside any quoted section).
Bille answered 23/9, 2024 at 14:17 Comment(2)
Thanks. The problem is that one cannot assume that special characters, such as parentheses, commas, quotes and slashes, are not in the individual values themselves.Ectropion
but this is easy to cehck start with any character (like #) search for it and if it is there replicate it to ##, continue until you find a delimiter which is not yet in the string.Bille

© 2022 - 2025 — McMap. All rights reserved.