Split keep repeated delimiter
Asked Answered
S

2

12

I'm trying to use the stringi package to split on a delimiter (potentially the delimiter is repeated) yet keep the delimiter. This is similar to this question I asked moons ago: R split on delimiter (split) keep the delimiter (split) but the delimiter can be repeated. I don't think base strsplit can handle this type of regex. The stringi package can but I can't figure out how to format the regex to it splits on the delimiter if there are repeats and also not to leave an empty string at the end of the string.

Base R solutions, stringr, stringi etc. solutions all welcomed.

The later problem occurs because I use greedy * on the \\s but the space isn't garunteed so I could only think to leave it in:

MWE

text.var <- c("I want to split here.But also||Why?",
   "See! Split at end but no empty.",
   "a third string.  It has two sentences"
)

library(stringi)   
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")

# Outcome

## [[1]]
## [1] "I want to split here." "But also|"     "|"          "Why?"                 
## [5] ""                     
## 
## [[2]]
## [1] "See!"       "Split at end but no empty." ""                          
## 
## [[3]]
## [1] "a third string."      "It has two sentences"

# Desired Outcome

## [[1]]
## [1] "I want to split here." "But also||"                     "Why?"                                  
## 
## [[2]]
## [1] "See!"         "Split at end but no empty."                         
## 
## [[3]]
## [1] "a third string."      "It has two sentences"
Starrstarred answered 22/10, 2014 at 14:19 Comment(0)
R
8

Using strsplit

 strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
 #[[1]]
 #[1] "I want to split here." "But also||"            "Why?"                 

 #[[2]]
 #[1] "See!"                       "Split at end but no empty."

 #[[3]]
 #[1] "a third string."      "It has two sentences"

Or

 library(stringi)
 stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
 #[[1]]
 #[1] "I want to split here." "But also||"            "Why?"                 

 #[[2]]
 #[1] "See!"                       "Split at end but no empty."

 #[[3]]
 #[1] "a third string."      "It has two sentences"
Rooke answered 22/10, 2014 at 15:39 Comment(4)
Would you mind explaining what *SKIP and *F are, and what roles they play in the regex?Workwoman
@Josh O'Brien Thanks for the comments. Actually,*SKIP *F are not needed. I used it previously while I was working on the code and didn't check it afterwards.Rooke
@Tyle Rinker Thanks. Also the *SKIP *F part was not working with stringi.Rooke
Both approaches work well but this one is a little more concise. Thank you +1Starrstarred
S
6

Just use a pattern that finds inter-character locations that: (1) are preceded by one of ?.!|; and (2) are not followed by one of ?.!|. Tack on \\s* to match and eat up any number of consecutive space characters, and you're good to go.

##                  (look-behind)(look-ahead)(spaces)
strsplit(text.var, "(?<=([?.!|]))(?!([?.!|]))\\s*", perl=TRUE)
# [[1]]
# [1] "I want to split here." "But also||"            "Why?"                 
# 
# [[2]]
# [1] "See!"                       "Split at end but no empty."
# 
# [[3]]
# [1] "a third string."      "It has two sentences"
Sinner answered 22/10, 2014 at 17:10 Comment(1)
You showed me where my regex thinking was wrong which is a huge help to learning. akrun's approach is a bit more concise. +1Starrstarred

© 2022 - 2024 — McMap. All rights reserved.