Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr
Asked Answered
H

3

8

The Background

I have an issue for which a number of solution pathways are possible, but I am convinced there is an as-yet-undiscovered elegant solution leveraging purrr.

The Example Code

I have a large data frame as follows, for which I have included an example below:

library(tibble)
library(ggmap)
library(purrr)
library(dplyr)

# Define Example Data
df <- frame_data(
  ~Street,                ~City,        ~State,     ~Zip,  ~lon,      ~lat,
  "226 W 46th St",        "New York",   "New York", 10036, -73.9867,  40.75902,
  "5th Ave",              "New York",   "New York", 10022, NA,        NA,
  "75 Broadway",          "New York",   "New York", 10006, -74.01205, 40.70814,
  "350 5th Ave",          "New York",   "New York", 10118, -73.98566, 40.74871,
  "20 Sagamore Hill Rd",  "Oyster Bay", "New York", 11771, NA,        NA,
  "45 Rockefeller Plaza", "New York",   "New York", 10111, -73.97771, 40.75915
)

The Challenge

I would like to geotag all locations for which the lon and lat columns are currently NA. There are many ways I could go about this, one of which is shown below:

# Safe Code is Great Code
safe_geocode <- safely(geocode)

# Identify Data to be Geotagged by Absence of lon and lat
data_to_be_geotagged <- df %>% filter(is.na(lon) | is.na(lat))

# GeoTag Addresses of Missing Data Points
fullAddress <- paste(data_to_be_geotagged$Street,
                     data_to_be_geotagged$City,
                     data_to_be_geotagged$State,
                     data_to_be_geotagged$Zip,
                     sep = ", ")

fullAddress %>% 
  map(safe_geocode) %>% 
  map("result") %>%
  plyr::ldply()

The Question

While I can get the above to work, and even wrangle the newly identified lon and lat coordinates back into the original data frame, the whole scheme feels dirty. I am convinced there is an elegant way to leverage piping and purrr to go through the data-frame and conditionally geotag the locations based on the absence of lon and lat.

I have been down a number of rabbit holes including purrr::pmap in an attempt to walk through multiple columns in parallel when constructing the full address (As well as rowwise() and by_row()). Nevertheless, I fall short in constructing anything that would qualify as an elegant solution.

Any insight provided would be most appreciated.

Hurleigh answered 26/8, 2016 at 18:17 Comment(3)
While somewhat amusing, your title doesn't do much in terms of describing the actual problem.Appalling
I thought map_if within a mutate would be helpful here, but it quickly became clear that I don't quite understand how it works. So far the only thing I've come up with uses unnest after a map-within-mutate but doesn't use purrr at all other than that.Talithatalk
@Talithatalk - I appreciate you weighing in. From every angle I've tried to tackle this, my initial tactic seems to quickly break down - And yet, from a high-level, it seems we should be able to perform operations row-by-row on one set of columns, based on the values of another. I can't shake the impression that I am missing something simple here.Hurleigh
C
7

Really, you want to avoid calling geocode any more than necessary because it's slow and if you're using Google, you only have 2500 queries per day. Thus, it's best to make both columns from the same call, which can be done with a list column, making a new version of the data.frame with do, or a self-join.


1. With a list column

With a list column, you make a new version of lon and lat with ifelse, geocoding if there are NAs, else just copying the existing values. Afterwards, get rid of the old versions of the columns and unnest the new ones:

library(dplyr)
library(ggmap)
library(tidyr)    # For `unnest`

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Add a list column. If lon or lat are NA,
    mutate(data = ifelse(any(is.na(c(lon, lat))), 
                         # return a data.frame of the geocoded results,
                         list(geocode(paste(Street, City, State, Zip))), 
                         # else return a data.frame of existing columns.
                         list(data_frame(lon = lon, lat = lat)))) %>% 
    # Remove old columns
    select(-lon, -lat) %>% 
    # Unnest newly created ones from list column
    unnest(data)

## # A tibble: 6 × 6
##                 Street       City    State   Zip       lon      lat
##                  <chr>      <chr>    <chr> <dbl>     <dbl>    <dbl>
## 1        226 W 46th St   New York New York 10036 -73.98670 40.75902
## 2              5th Ave   New York New York 10022 -73.97491 40.76167
## 3          75 Broadway   New York New York 10006 -74.01205 40.70814
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915

2. With do

do, on the other hand, creates a wholly new data.frame from pieces of the old one. It requires slightly clunky $ notation, with . to represent the grouped data.frame piped in. Using if and else instead of ifelse lets you avoid nesting results in lists (which they had to be above, anyway).

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Make a new data.frame from the first four columns and the geocode results or existing lon/lat
    do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
        geocode(paste(.[1:4], collapse = ' '))
    } else {
        .[5:6]
    }))

which returns exactly the same thing as the first version.


3. On a subset, recombining with a self-join

If the ifelse is overly confusing, you can just geocode a subset and then recombine by binding the rows to the anti_join, i.e. all the rows that are in df but not the subset .:

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>% 
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

which returns the same thing, but with the newly geocoded rows at the top. The same approach works with a list column or do, but since there's no need to combine two sets of columns, just bind_cols will do the trick.


4. On a subset with mutate_geocode

ggmap actually includes a mutate_geocode function that will add lon and lat columns when passed a data.frame and a column of addresses. It has an issue: it can't accept more than a column name for the address, and thus requires a single column with the entire address. Thus, while this version could be quite nice, it requires creating and deleting an extra column with the whole address, making it inconcise:

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    mutate(address = paste(Street, City, State, Zip)) %>%    # make an address column
    mutate_geocode(address) %>% 
    select(-address) %>%    # get rid of address column
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

##                 Street       City    State   Zip       lon      lat
## 1              5th Ave   New York New York 10022 -73.97491 40.76167
## 2  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5          75 Broadway   New York New York 10006 -74.01205 40.70814
## 6        226 W 46th St   New York New York 10036 -73.98670 40.75902

5. Base R

Base R can assign to a subset directly, which makes the idiom here much simpler, even if it requires a lot of subsetting:

df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])

Results are the same as the first version.


All versions only call geocode twice.

Note that while you could use purrr for the job, it's not particularly better suited than regular dplyr. purrr excels at dealing with lists, and while a list column is one option, it doesn't really have to be manipulated.

Crosscrosslet answered 27/8, 2016 at 2:8 Comment(6)
Cyrus and Alistaire - You guys both nailed it - Thank you for the detailed explanations, it helps tremendously. Accepting this second answer as it prevents duplicative calls to geocode.Hurleigh
Perhaps I spoke too soon. Running your Solution #1, which appears to be the ideal solution, results in an error: Error: is.data.frame(data) is not TRUEHurleigh
Hmm, do you have something else called data in your environment? You might just try renaming the column. Alternately, just take out the name in unnest; it operates on all list columns by default anyway.Crosscrosslet
Alistaire - You are a machine. As I was posting this update that the mistake was on my end, your inferred root cause was spot on. All is resolved (data was in my environment...). Thanks again!Hurleigh
@Crosscrosslet mutate_geocode now plays well with tibbles - Ref: 7772b3aAlcantara
@StevenBeaupré Thanks for the update! Edited for the new version.Crosscrosslet
M
3

I'm not sure abut purrr but here's the following using the pipe:

df <- frame_data(
  ~Street,                ~City,        ~State,     ~Zip,  ~lon,      ~lat,
  "226 W 46th St",        "New York",   "New York", 10036, -73.9867,  40.75902,
  "5th Ave",              "New York",   "New York", 10022, NA,        NA,
  "75 Broadway",          "New York",   "New York", 10006, -74.01205, 40.70814,
  "350 5th Ave",          "New York",   "New York", 10118, -73.98566, 40.74871,
  "20 Sagamore Hill Rd",  "Oyster Bay", "New York", 11771, NA,        NA,
  "45 Rockefeller Plaza", "New York",   "New York", 10111, -73.97771, 40.75915
)

df2<-df %>%
  filter(is.na(lon) | is.na(lat)) %>%
  group_by(Street, City, State) %>% #not really necessary but it suppresses a warning
  mutate(lon=ifelse(is.na(lon) | is.na(lat), 
    geocode(paste(Street, City,State, sep=" ")), 0)) %>%
  mutate(lat=ifelse(is.na(lon) | is.na(lat), 
    rev(geocode(paste(Street, City,State, sep=" "))), 0))

If you want the partial output like in your example code above:

as.data.frame(df2)[,5:6]
       lon       lat
1 40.77505 -73.96515
2 40.88259 -73.50538

Or include all columns:

as.data.frame(df2)
              Street       City    State   Zip      lon       lat
1             5th Ave   New York New York 10022 40.77505 -73.96515
2 20 Sagamore Hill Rd Oyster Bay New York 11771 40.88259 -73.50538

And if you want to combine your original data with the new data you can do the following:

as.data.frame(rbind(filter(df, !is.na(lon) | !is.na(lat)),df2 ))
                Street       City    State   Zip       lon       lat
1        226 W 46th St   New York New York 10036 -73.98670  40.75902
2          75 Broadway   New York New York 10006 -74.01205  40.70814
3          350 5th Ave   New York New York 10118 -73.98566  40.74871
4 45 Rockefeller Plaza   New York New York 10111 -73.97771  40.75915
5              5th Ave   New York New York 10022  40.77505 -73.96515
6  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.96515  40.77505

...Or you can streamline it all in one like in the following (keeps original order):

df2<-df %>%
  #group_by(Street, City, State) %>% # unescape if you want to suppress warning
  mutate(lon=ifelse(is.na(lon) | is.na(lat), 
    geocode(paste(Street, City,State, sep=" ")), lon)) %>%
  mutate(lat=ifelse(is.na(lon) | is.na(lat), 
    rev(geocode(paste(Street, City,State, sep=" "))), lat))

as.data.frame(df2)
                Street       City    State   Zip       lon       lat
1        226 W 46th St   New York New York 10036 -73.98670  40.75902
2              5th Ave   New York New York 10022 -73.98670  40.75902
3          75 Broadway   New York New York 10006 -74.01205  40.70814
4          350 5th Ave   New York New York 10118 -73.98566  40.74871
5  20 Sagamore Hill Rd Oyster Bay New York 11771  40.75902 -73.98670
6 45 Rockefeller Plaza   New York New York 10111 -73.97771  40.75915
Macario answered 26/8, 2016 at 20:45 Comment(0)
P
0

Using dplyr:

df %>% mutate( lon = case_when( is.na(lon) ~ geocode(paste(Street, City, State, Zip))[,1],
                                TRUE       ~ lon),

               lat = case_when( is.na(lat) ~ geocode(paste(Street, City, State, Zip))[,2],
                                TRUE       ~ lat )
               )
Publias answered 17/10, 2017 at 22:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.