Is there a way to deal with nested data with sparklyr?
Asked Answered
D

4

10

In the following example I've loaded a parquet file that contains a nested record of map objects in the meta field. sparklyr seems to do a nice job of dealing with these. However tidyr::unnest does not translate to SQL (or HQL - understandably - like LATERAL VIEW explode()) and is thus not usable. Is there a way to unnest data some other way?

tfl <- head(tf)
tfl
Source:   query [?? x 10]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

                            trkKey             meta     sources startTime
                             <chr>           <list>      <list>    <list>
1 3juPe-k0yiMcANNMa_YiAJfJyU7WCQ3Q <S3: spark_jobj> <list [24]> <dbl [1]>
2 3juPe-k0yiAJX3ocJj1fVqru-e0syjvQ <S3: spark_jobj>  <list [1]> <dbl [1]>
3 3juPe-k0yisY7UY_ufUPUo5mE1xGfmNw <S3: spark_jobj>  <list [7]> <dbl [1]>
4 3juPe-k0yikXT5FhqNj87IwBw1Oy-6cw <S3: spark_jobj> <list [24]> <dbl [1]>
5 3juPe-k0yi4MMU63FEWYTNKxvDpYwsRw <S3: spark_jobj>  <list [7]> <dbl [1]>
6 3juPe-k0yiFBz2uPbOQqKibCFwn7Fmlw <S3: spark_jobj> <list [19]> <dbl [1]>
# ... with 6 more variables: endTime <list>, durationInMinutes <dbl>,
#   numPoints <int>, maxSpeed <dbl>, maxAltitude <dbl>, primaryKey <chr>

There is also an issue when the data is collected. E.g.,

tfl <- head(tf) %>% collect()
tfl
# A tibble: 6 × 10
                            trkKey             meta     sources startTime
                             <chr>           <list>      <list>    <list>
1 3juPe-k0yiMcANNMa_YiAJfJyU7WCQ3Q <S3: spark_jobj> <list [24]> <dbl [1]>
2 3juPe-k0yiAJX3ocJj1fVqru-e0syjvQ <S3: spark_jobj>  <list [1]> <dbl [1]>
3 3juPe-k0yisY7UY_ufUPUo5mE1xGfmNw <S3: spark_jobj>  <list [7]> <dbl [1]>
4 3juPe-k0yikXT5FhqNj87IwBw1Oy-6cw <S3: spark_jobj> <list [24]> <dbl [1]>
5 3juPe-k0yi4MMU63FEWYTNKxvDpYwsRw <S3: spark_jobj>  <list [7]> <dbl [1]>
6 3juPe-k0yiFBz2uPbOQqKibCFwn7Fmlw <S3: spark_jobj> <list [19]> <dbl [1]>
# ... with 6 more variables: endTime <list>, durationInMinutes <dbl>,
#   numPoints <int>, maxSpeed <dbl>, maxAltitude <dbl>, primaryKey <chr>

tfl %>% unnest(meta)
Error: Each column must either be a list of vectors or a list of data frames [meta]

In the above, the meta file still contains spark_jobj elements instead of lists, data.frames, or even JSON strings (which is how Hive would return such data). This creates a situation where tidyr doesn't even work on the collected data.

Is there a way to get sparklyr to work more nicely with tidyr that I am missing? If not, is this planned for future sparklyr development?

Dumbfound answered 1/9, 2016 at 16:52 Comment(0)
D
2

I finally have my answer to this. See https://mitre.github.io/sparklyr.nested/ (source: https://github.com/mitre/sparklyr.nested)

tf %>% 
  sdf_unnest(meta)

This will behave for Spark data frames similarly to how tidyr::unnest behaves for local data frames. Nested select and explode operations are also implemented.

Update:

as @cem-bilge notes explode can be used inside mutate. This is effective in situations where the array is simple (character or numeric) but less great in other situations.

iris2 <- copy_to(sc, iris, name="iris")
iris_nst <- iris2 %>%
  sdf_nest(Sepal_Length, Sepal_Width, Petal.Length, Petal.Width, .key="data") %>%
  group_by(Species) %>%
  summarize(data=collect_list(data))

Then

iris_nst %>% mutate(data = explode(data)) %>% sdf_schema_viewer()

produces

enter image description here

with the fields still nested (though exploded) whereas sdf_unnest yields

iris_nst %>% sdf_unnest(data) %>% sdf_schema_viewer()

enter image description here

Dumbfound answered 29/8, 2017 at 0:45 Comment(2)
I find it surprising, that this important functionality is not part of the sparklyr-package. I wonder why.Biff
Would sdf_separate_column be n alternative solution to the problem of nested tables? github.com/rstudio/sparklyr/issues/690Biff
D
2

This isn't exactly a proper solution, however one work around is to use Hive to generate a table or view (e.g., create view db_name.table_name as select ...). that handles the explode operation. This serves up flat data for sparklyr to work on. Where sc is a spark connection via sparklyr one can use DBI::dbGetQuery(sc, "USE db_name") assuming Hive is configured and the view will then show up when listing tables with src_tbls(sc). Once you execute dat <- tbl(sc, "table_name") then it should be smoother sailing from there.

Since this is not a sparklyr solution (but more of a Hive solution) I'll not accept this answer.

Dumbfound answered 3/9, 2016 at 9:56 Comment(0)
D
2

I finally have my answer to this. See https://mitre.github.io/sparklyr.nested/ (source: https://github.com/mitre/sparklyr.nested)

tf %>% 
  sdf_unnest(meta)

This will behave for Spark data frames similarly to how tidyr::unnest behaves for local data frames. Nested select and explode operations are also implemented.

Update:

as @cem-bilge notes explode can be used inside mutate. This is effective in situations where the array is simple (character or numeric) but less great in other situations.

iris2 <- copy_to(sc, iris, name="iris")
iris_nst <- iris2 %>%
  sdf_nest(Sepal_Length, Sepal_Width, Petal.Length, Petal.Width, .key="data") %>%
  group_by(Species) %>%
  summarize(data=collect_list(data))

Then

iris_nst %>% mutate(data = explode(data)) %>% sdf_schema_viewer()

produces

enter image description here

with the fields still nested (though exploded) whereas sdf_unnest yields

iris_nst %>% sdf_unnest(data) %>% sdf_schema_viewer()

enter image description here

Dumbfound answered 29/8, 2017 at 0:45 Comment(2)
I find it surprising, that this important functionality is not part of the sparklyr-package. I wonder why.Biff
Would sdf_separate_column be n alternative solution to the problem of nested tables? github.com/rstudio/sparklyr/issues/690Biff
D
1

Here is another option that doesn't depend on Hive (at least as directly, LATERAL VIEW explode() is a hive thing).

tf %>% 
  sdf_mutate(ft_sql_transformer(
    b, paste0("SELECT trkKey, a.fld1 as fld1, a.fld2 as fld2",
              "FROM __THIS__ LATERAL VIEW explode(__THIS__.meta) x AS a")))

I'm not going to accept this answer either since I would still like to see something like:

tf %>%
  sdf_mutate(a=ft_explode(meta))

but that would require support for nested select statements. Perhaps something like the tidyr::unnest syntax would do the trick:

tf %>%
  sdf_mutate(a=ft_explode(meta)) %>%
  unnest(a)
Dumbfound answered 27/10, 2016 at 21:10 Comment(0)
W
0

You can also use explode() directly in a mutate() to expand arrays in sparklyr.

df %>%
  mutate(my_values = explode(my_array))

Note: sparklyr.nested is not required for this.

Weeper answered 10/10, 2018 at 7:48 Comment(1)
This works great if my_array is a simple character/numeric array. Trouble if it is an array of structs though.Dumbfound

© 2022 - 2024 — McMap. All rights reserved.