Convert latitude and longitude string vector into data frame
Asked Answered
N

5

5

I am struggling to parse the location strings I have in my data.

The location is inconveniently set up as a string with both the latitude and longitude info bundled together and I want to extract that info into a separate variable for each (and for each observation).

The data I'm trying to parse looks like this:

ID <- c(1, 2, 3)
location_1 <- c("lat:10.1234567,lng:-70.1234567", "lat:20.1234567891234,lng:-80.1234567891234", "lat:30.1234567,lng:-90.1234567")

df <- data.frame(ID, location_1)

ID   location_1
1     lat:10.1234567,lng:-70.1234567                                                
2     lat:20.1234567891234,lng:-80.1234567891234
3     lat:30.1234567,lng:-90.1234567

I'm trying to get them to look like this:

ID  latitude            longitude
1   10.1234567          -70.1234567
2   20.1234567891234    -80.1234567891234
3   30.12345            -90.12345

I've tried a few different solutions but I can't quite figure out the right phrasing to extract the coordinates.

One I tried was

f <- data.frame(Latitude = str_extract_all(dl$location_1, "\\d+\\.\\d+")[[1]], 
                 Longitude = str_extract_all(dl$location_1, "(?<=,\\s)\\d+\\.\\d+(?=\\))")[[1]])

another was

strcapture("\\(([-0-9.]+)\\s+([-0-9.]+)", location_1, proto = list(lon = 1,lat = 1))

but neither quite fit my original data so I keep getting NAs.

Newlywed answered 29/5 at 19:24 Comment(1)
Where do these strings come from? Why are they not WKT?Mortar
K
3

I use tidyr::separate_wider_delim to separate your single column into two columns, breaking at the comma. Then, with dplyr::across we can apply readr::parse_number to parse the number out of the string for both columns:

library(tidyr)
library(dplyr)
library(readr)
df |>
  separate_wider_delim(location_1, delim = ",", names = c("lat", "lon")) |>
  mutate(across(c(lat, lon), parse_number))
# # A tibble: 3 × 3
#      ID   lat   lon
#   <dbl> <dbl> <dbl>
# 1     1  10.1 -70.1
# 2     2  20.1 -80.1
# 3     3  30.1 -90.1  
Kazim answered 29/5 at 19:52 Comment(1)
This worked perfectly! It also helps that it left my other columns intactNewlywed
U
5

in base R, convert to a valid dcf and use read.dcf

a <- paste0("ID:", df$ID,"\n", gsub(",", "\n", df$location_1), collapse = "\n\n")
type.convert(read.dcf(textConnection(a), all = TRUE), as.is = TRUE)

  ID      lat       lng
1  1 10.12346 -70.12346
2  2 20.12346 -80.12346
3  3 30.12346 -90.12346
Unusual answered 29/5 at 20:18 Comment(0)
K
3

I use tidyr::separate_wider_delim to separate your single column into two columns, breaking at the comma. Then, with dplyr::across we can apply readr::parse_number to parse the number out of the string for both columns:

library(tidyr)
library(dplyr)
library(readr)
df |>
  separate_wider_delim(location_1, delim = ",", names = c("lat", "lon")) |>
  mutate(across(c(lat, lon), parse_number))
# # A tibble: 3 × 3
#      ID   lat   lon
#   <dbl> <dbl> <dbl>
# 1     1  10.1 -70.1
# 2     2  20.1 -80.1
# 3     3  30.1 -90.1  
Kazim answered 29/5 at 19:52 Comment(1)
This worked perfectly! It also helps that it left my other columns intactNewlywed
O
3

Your strcapture attempt has to outline the complete string, not just a part of it:

cbind(
    df,
    strcapture("lat:([-0-9.]+),lng:([-0-9.]+)", 
               df$location_1, proto=list(latitude=1,longitude=1))
)
#  ID                                 location_1 latitude longitude
#1  1             lat:10.1234567,lng:-70.1234567 10.12346 -70.12346
#2  2 lat:20.1234567891234,lng:-80.1234567891234 20.12346 -80.12346
#3  3             lat:30.1234567,lng:-90.1234567 30.12346 -90.1234
Outroar answered 29/5 at 21:25 Comment(0)
H
2

You could continue down the stringr/regex path with:

pattern <- "lat:(-?\\d+\\.\\d+),lng:(-?\\d+\\.\\d+)"

Where

  • the () define capture groups (which can be used in str_extract())
  • ? is a quantifier for 0 or 1 of the previous
library(dplyr)
library(stringr)

ID <- c(1, 2, 3)
location_1 <- c(
  "lat:10.1234567,lng:-70.1234567", 
  "lat:20.1234567891234,lng:-80.1234567891234", 
  "lat:30.1234567,lng:-90.1234567"
)

pattern <- "lat:(\\d+\\.\\d+),lng:(-?\\d+\\.\\d+)"

data.frame(ID, location_1) |>
  mutate(
    Latitude = str_extract(location_1, pattern, 1), 
    Longitude = str_extract(location_1, pattern, 2)
  ) |>
  select(-location_1)
#>   ID         Latitude         Longitude
#> 1  1       10.1234567       -70.1234567
#> 2  2 20.1234567891234 -80.1234567891234
#> 3  3       30.1234567       -90.1234567

Created on 2024-05-29 with reprex v2.1.0

Hyozo answered 29/5 at 20:0 Comment(0)
T
1

To me the obvious is gsub() + strsplit(), thus I add it as a slightly error-prone answer. I would go with @thelatemail's approach.

fn = \(char, p = "[a-zA-Z:]") {
  x = sapply(strsplit(gsub(p, "", char), ","), as.numeric)
  data.frame(latitude = x[1L, ], longitude = x[2L, ]) }
> cbind(df, fn(df$location_1))
  ID                                 location_1 latitude longitude
1  1             lat:10.1234567,lng:-70.1234567 10.12346 -70.12346
2  2 lat:20.1234567891234,lng:-80.1234567891234 20.12346 -80.12346
3  3             lat:30.1234567,lng:-90.1234567 30.12346 -90.12346

Note. data.table::tstrsplit has type.convert as an argument.

Teleran answered 29/5 at 22:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.