How do I extract multiple character strings from one line using R

Asked 14/5, 2014 at 21:5 Answered 14/5, 2014 at 22:18

I would like to extract multiple character strings from one line.

suppose I have the following text line (taken with the 'readLines' function form a website):

line <- "abc:city1-street1-long1-lat1,ldjad;skj//abc:city2-street2-long2-lat2,ldjad;skj//abc:city3-street3-long3-lat3,ldjad;skj//abc:city3-street3-long3-lat3,ldjad;skj//"

I would like to extract the following to separate lines:

[1] city1-street1-long1-lat1
[2] city2-street2-long2-lat2
[3] city3-street3-long3-lat3
[4] city4-street4-long4-lat4

I hope someone can give me a hint how to perform this task.

Elecampane answered 14/5, 2014 at 21:5 Comment(0)

regmatches to the rescue:

regmatches(line,gregexpr("city\\d+-street\\d+-long\\d+-lat\\d+",line))
#[[1]]
#[1] "city1-street1-long1-lat1"
#[2] "city2-street2-long2-lat2"
#[3] "city3-street3-long3-lat3"
#[4] "city3-street3-long3-lat3"

Pate answered 14/5, 2014 at 21:9 Comment(0)

A solution with the stringi package:

library(stringi)
stri_extract_all_regex(line, "(?<=:).+?(?=,)")[[1]]
## [1] "city1-street1-long1-lat1" "city2-street2-long2-lat2" "city3-street3-long3-lat3" "city3-street3-long3-lat3"

And with the stringr package:

library(stringr)
str_extract_all(line, perl("(?<=:).+?(?=,)"))[[1]]
## [1] "city1-street1-long1-lat1" "city2-street2-long2-lat2" "city3-street3-long3-lat3" "city3-street3-long3-lat3"

In both cases we are using regular expressions. Here, we are matching all the characters (non-greedily, i.e. with .+?) which occur between : and ,. (?<=:) means a positive look-behind: : will be matched, but not included in the result. On the other hand, (?=,) is a positive look-ahead: , must be matched but will not appear in the output.

Some benchmarks:

lines <- stri_dup(line, 250) # duplicate line 250 times
library(microbenchmark)
microbenchmark(
   stri_extract_all_regex(lines, "(?<=:).+?(?=,)")[[1]],
   str_extract_all(lines, perl("(?<=:).+?(?=,)"))[[1]],
   regmatches(lines, gregexpr("city\\d+-street\\d+-long\\d+-lat\\d+", lines)),
   lapply(unlist(strsplit(lines,',')),
       function(x)unlist(strsplit(x,':'))[2]),
   lapply(strsplit(lines,'//'),
        function(x)
          sub('.*:(.*),.*','\\1',x))
)
## Unit: milliseconds
##                            expr         min         lq     median             uq        max neval
## gagolews-stri_extract_all_regex    4.722515   4.811009   4.835948       4.883854   6.080912   100
##        gagolews-str_extract_all  103.514964 103.824223 104.387175     106.246773 117.279208   100
##          thelatemail-regmatches   36.049106  36.172549  36.342945      36.967325  47.399339   100
##                  agstudy-lapply   21.152761  21.500726  21.792979      22.809145  37.273120   100
##                 agstudy-lapply2    8.763783   8.854666   8.930955       9.128782  10.302468   100

As you see, the stringi-based solution is the fastest.

Cinemascope answered 14/5, 2014 at 21:9 Comment(4)

You might add the strapply function from the gsubfn package as another option. – Perdomo 14/5, 2014 at 21:33

@Cinemascope Are you the author of stringi package? good auto-promotion :) – Ukase 14/5, 2014 at 21:33

I didn't know it will be so fast ;) – Cinemascope 14/5, 2014 at 21:35

I never understood lookahead/behind regex until your explanation. Simple, but I was clearly overthinking it. – Pate 14/5, 2014 at 22:49

Another option without using regular expression:

unlist(lapply(unlist(strsplit(line,',')),function(x)unlist(strsplit(x,':'))[2]))

"city1-street1-long1-lat1" 
"city2-street2-long2-lat2" 
"city3-street3-long3-lat3"
"city3-street3-long3-lat3"
 NA

EDIT better solution

Using conbination of strssplit and sub. No need to set the exact complicated structure but just using grouping feature:

lapply(strsplit(line,'//'),function(x) sub('.*:(.*),.*','\\1',x))
[[1]]
[1] "city1-street1-long1-lat1" 
    "city2-street2-long2-lat2" 
    "city3-street3-long3-lat3" 
    "city3-street3-long3-lat3"

Ukase answered 14/5, 2014 at 21:14 Comment(0)

For something simple like this, base R handles this just fine.

matches <- regmatches(line, gregexpr('(?<=:).*?(?=,)', line, perl=T))

Arst answered 14/5, 2014 at 22:18 Comment(6)

Thanks for your help. With my simplified example this works very well. However in the actual situation I need to extract all strings that are between '{\"name\":\' and '\"}'. I hope you (or someone else) can advise how to perform this task. – Elecampane 15/5, 2014 at 20:3

@Elecampane Can you provide an example input line? – Arst 15/5, 2014 at 20:18

It is a very long line (that why I simplified the example). To get the data you can run the following code: www_SB <- readLines('schoonenberg.nl/winkels/') line <- www_SB[300] – Elecampane 15/5, 2014 at 20:33

Ok so for example,

{"name":"Schoonenberg Hoorcomfort","url":"\/winkels\/schoonenberg-hoorcomfort-aalsmeer-zijdstraat-17-a","longitude":4.7467513,"latitude":52.2691185,"features":[],"id":"392858746"}

what part do you want out of that? – Arst 15/5, 2014 at 21:5

Thanks for you reply and your time. The part I want to extract is: Schoonenberg Hoorcomfort","url":"\/winkels\/schoonenberg-hoorcomfort-aalsmeer-zijdstraat-17-a‌","longitude":4.7467513,"latitude":52.2691185 – Elecampane 16/5, 2014 at 18:38

(?<="name":).*?(?=,"features") – Arst 16/5, 2014 at 22:30

EDIT better solution

Recommended topics

Hot tags