In R, split a character vector by a specific character; save 3rd piece in new vector
Asked Answered
S

3

7

I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

based on a response here: R remove part of string. I get a single response of “1”. If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

I can generate the split list:

> head(oss$point)
[[1]]
[1] "sil"  "2007" "1"   

[[2]]
[1] "sil"  "2007" "2"   

[[3]]
[1] "sil"  "2007" "3"   

[[4]]
[1] "sil"  "2007" "4"   

[[5]]
[1] "sil"  "2007" "5"   

[[6]]
[1] "sil"  "2007" "6"  

Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

Supervisor answered 16/10, 2013 at 17:44 Comment(0)
P
18

strsplit creates a list, so I would try the following:

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

The [ means to extract the third element. If you prefer a vector, substitute lapply with sapply.

Here's an example:

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# 
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

If there is an easily definable pattern, gsub might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

And, finally, just for fun, you can expand on the unlist approach to make it work by recycling a vector of TRUEs and FALSEs to extract every third item (since we know in advance that all the splits will result in an identical structure).

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.

Use a greedy regex:

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

Use a convenience function like stri_extract* from the "stringi" package:

library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
Paralyze answered 16/10, 2013 at 17:47 Comment(5)
I like gsub() here, and might just do gsub(".*_.*_", "", mystring) or even (because regex matching is by default greedy) gsub(".*_", "", mystring)Margemargeaux
I would imaging that adding "^" to the beginning of that pattern would ensure that you get the third item rather than the last of many items. Regex pattern interpretations are greedy. mystring <- c("A_B_C_D_E_F"); gsub(".*_.*_(.*)", "\\1", mystring) ... returns [1] "F"Desperation
This is the pattern I found ensured the third (non-"_") item: "^[^_]+_[^_]+_([^_]+)_.*"Desperation
@DWin, I definitely agree with you that a safer approach should be taken or that the OP should make sure they understand what they are doing if they are going the gsub route, but judging by their description and their sample output from strsplit, the pattern is pretty predictable (in which case, I like Josh's comment-answer better than mine). Thanks for the alternatives :)Paralyze
@DWin -- Neat. Alternatively, we could use ? to make pieces of the regex non-greedy, like this: gsub(".*?_.*?_(.*?)_.*", "\\1", mystring), and something like this will work to get the 5th element: gsub("(.*?_){4}(.*?)_.*", "\\2", mystring).Margemargeaux
P
0

Is this what you need?

x = c('aaa_9999_12', 'bbb_9999_20')
ids = sapply(x, function(v){strsplit(v, '_')[[1]][3]}, USE.NAMES = FALSE)

# optional
# ids = as.numeric(ids)

This is VERY inefficient, there's probably a better way.

Pyrimidine answered 16/10, 2013 at 17:48 Comment(0)
N
0

Since stringr 1.5.0, str_split_i is available. This function allows one to acess the ith element of a string split.

x <- c('aaa_9999_12', 'bbb_9999_20')
str_split_i(x, '_', 3)
#[1] "12" "20"
Nahuatlan answered 3/3, 2023 at 16:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.