A solution with the stringi package:
library(stringi)
stri_extract_all_regex(line, "(?<=:).+?(?=,)")[[1]]
## [1] "city1-street1-long1-lat1" "city2-street2-long2-lat2" "city3-street3-long3-lat3" "city3-street3-long3-lat3"
And with the stringr package:
library(stringr)
str_extract_all(line, perl("(?<=:).+?(?=,)"))[[1]]
## [1] "city1-street1-long1-lat1" "city2-street2-long2-lat2" "city3-street3-long3-lat3" "city3-street3-long3-lat3"
In both cases we are using regular expressions.
Here, we are matching all the characters (non-greedily, i.e. with .+?
)
which occur between :
and ,
. (?<=:)
means a positive look-behind: :
will be matched, but not included in the result. On the other hand, (?=,)
is a positive look-ahead: ,
must be matched but will not appear in the output.
Some benchmarks:
lines <- stri_dup(line, 250) # duplicate line 250 times
library(microbenchmark)
microbenchmark(
stri_extract_all_regex(lines, "(?<=:).+?(?=,)")[[1]],
str_extract_all(lines, perl("(?<=:).+?(?=,)"))[[1]],
regmatches(lines, gregexpr("city\\d+-street\\d+-long\\d+-lat\\d+", lines)),
lapply(unlist(strsplit(lines,',')),
function(x)unlist(strsplit(x,':'))[2]),
lapply(strsplit(lines,'//'),
function(x)
sub('.*:(.*),.*','\\1',x))
)
## Unit: milliseconds
## expr min lq median uq max neval
## gagolews-stri_extract_all_regex 4.722515 4.811009 4.835948 4.883854 6.080912 100
## gagolews-str_extract_all 103.514964 103.824223 104.387175 106.246773 117.279208 100
## thelatemail-regmatches 36.049106 36.172549 36.342945 36.967325 47.399339 100
## agstudy-lapply 21.152761 21.500726 21.792979 22.809145 37.273120 100
## agstudy-lapply2 8.763783 8.854666 8.930955 9.128782 10.302468 100
As you see, the stringi
-based solution is the fastest.
strapply
function from the gsubfn package as another option. – Perdomo