How to split a string after the nth character in r
Asked Answered
N

6

5

I am working with the following data:

District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")

I want to split the string after the second character and put them into two columns.

So that the data looks like this:

state  district
AR        01
AZ        03
AZ        05
AZ        08
CA        01
CA        05
CA        11
CA        16
CA        18
CA        21

Is there a simple code to get this done? Thanks so much for you help

Nativity answered 5/2, 2020 at 21:0 Comment(2)
have you looked at substr?Photophobia
I have not. I'm more familiar with strsplit(). But since there is nothing to split on, its not applicable in this caseNativity
P
9

You can use substr if you always want to split by the second character.

District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district  starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)
Photophobia answered 5/2, 2020 at 21:3 Comment(1)
This is very helpful! Thank you for the code and for the annotation.Nativity
D
5

you could use strcapture from base R:

 strcapture("(\\w{2})(\\w{2})",District,
                    data.frame(state = character(),District = character()))
   state District
1     AR       01
2     AZ       03
3     AZ       05
4     AZ       08
5     CA       01
6     CA       05
7     CA       11
8     CA       16
9     CA       18
10    CA       21

where \\w{2} means two words

Damron answered 5/2, 2020 at 21:16 Comment(0)
S
5

The OP has written

I'm more familiar with strsplit(). But since there is nothing to split on, its not applicable in this case

Au contraire! There is something to split on and it's called lookbehind:

strsplit(District, "(?<=[A-Z]{2})", perl = TRUE) 

The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.

The result is a list of vectors

[[1]]
[1] "AR" "01"

[[2]]
[1] "AZ" "03"

[[3]]
[1] "AZ" "05"

[[4]]
[1] "AZ" "08"

[[5]]
[1] "CA" "01"

[[6]]
[1] "CA" "05"

[[7]]
[1] "CA" "11"

[[8]]
[1] "CA" "16"

[[9]]
[1] "CA" "18"

[[10]]
[1] "CA" "21"

which can be turned into a matrix, e.g., by

do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
      [,1] [,2]
 [1,] "AR" "01"
 [2,] "AZ" "03"
 [3,] "AZ" "05"
 [4,] "AZ" "08"
 [5,] "CA" "01"
 [6,] "CA" "05"
 [7,] "CA" "11"
 [8,] "CA" "16"
 [9,] "CA" "18"
[10,] "CA" "21"
Straightlaced answered 6/2, 2020 at 22:17 Comment(1)
Thank you for providing this. This faciliates efficiently splitting a large table of strings using tstrsplit in data.tableAbort
E
1

We can use str_match to capture first two characters and the remaining string in separate columns.

stringr::str_match(District, "(..)(.*)")[, -1]

#      [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"
Equanimous answered 6/2, 2020 at 2:56 Comment(0)
B
1

With the tidyverse this is very easy using the function separate from tidyr:

library(tidyverse)
District %>% 
  as.tibble() %>% 
  separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")

# A tibble: 10 × 2
   state district
   <chr> <chr>   
 1 AR    01      
 2 AZ    03      
 3 AZ    05      
 4 AZ    08      
 5 CA    01      
 6 CA    05      
 7 CA    11      
 8 CA    16      
 9 CA    18      
10 CA    21      
Bubbly answered 14/12, 2021 at 21:1 Comment(0)
E
0

Treat it as fixed width file, and import:

# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
#    V1 V2
# 1  AR 01
# 2  AZ 03
# 3  AZ 05
# 4  AZ 08
# 5  CA 01
# 6  CA 05
# 7  CA 11
# 8  CA 16
# 9  CA 18
# 10 CA 21
Epicure answered 26/5, 2020 at 20:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.