How to find the path of an element in a nested list
Asked Answered
D

3

6

How can I find the path of an element in a nested list without manually digging through a list in a View?

Here is an example that I can already deal with:

l1 <- list(x = list(a = "no_match", b = "test_noname", c ="test_noname"),
           y = list(a = "test_name"))

After looking for an off-the-shelf solution in other packages, I found this approach (strongly inspired by rlist::list.search):

list_search <- function(l, f) {
  ulist <- unlist(l, recursive = TRUE, use.names = TRUE)
  match <- f(ulist)
  ulist[match]
}
list_search(l1, f = \(x) x == "test_noname")
          x.b           x.c 
"test_noname" "test_noname" 

This works pretty well as it’s easy to understand that the name “x.b” here can be translated for access like this:

l1[["x"]][["b"]]
[1] "test_noname"
# Or
purrr::pluck(l1, "x", "b")
[1] "test_noname"

And I can get all elements on the same level, by leaving out the last level index:

l1[["x"]]
$a
[1] "no_match"

$b
[1] "test_noname"

$c
[1] "test_noname"

This is usually my goal, as I know the values/name of one of the elements I want to get to and other similar elements are placed on the same sub-level (or sub-sub-sub-sub-sub-sub-sub-level).

However, many JSON files on the internet are not quite meant for easy consumption and parse into much more complicated lists, that look more like this:

l2 <- list(x = list("no_match", list("test_noname1", "test_noname2")), y = list(a = "test_name"))
str(l2)
List of 2
 $ x:List of 2
  ..$ : chr "no_match"
  ..$ :List of 2
  .. ..$ : chr "test_noname1"
  .. ..$ : chr "test_noname2"
 $ y:List of 1
  ..$ a: chr "test_name"
list_search(l2, f = \(x) x == "test_noname1")
            x2 
"test_noname1" 

From the resulting names, I would guess that the element “x2” can be accessed like that:

l2[["x2"]]
NULL
# or maybe
l2[["x"]][[2]]
[[1]]
[1] "test_noname1"

[[2]]
[1] "test_noname2"

But to not also rake in “test_noname2” here, I actually need something like this:

l2[["x"]][[2]][[1]]
[1] "test_noname1"

Background

I often need to find the path of a known value when getting data through webscraping. The I might have a user named or URL that I know is somewhere in the data, but it's tedious to actually find it. Once one value is identified, it becomes easy to generalise to it's siblings, which are unknown so far. In the toy example, this would look like this:

l2[["x"]][[2]]
[[1]]
[1] "test_noname1"

[[2]]
[1] "test_noname2"

Only in reality, the lists I'm working with are nested much deeper.

So the issue is essentially unnamed elements in the list, that are not assigned names which are easy to generalise by unlist, or rapply for that matter. Ideally there would be an automated way to translate these into a pluck call.

Defant answered 17/3 at 10:19 Comment(5)
I think you can refer to a similar question https://mcmap.net/q/1770842/-how-to-run-function-on-the-deepest-level-only-in-a-nested-list/12158757 and the answer by @G. Grothendieck over there again https://mcmap.net/q/1770842/-how-to-run-function-on-the-deepest-level-only-in-a-nested-listApodal
What is your actual question? What is best? Or how to parse JSON and return specific JSON details from dynamic information?Energumen
Knowing that the value "test_noname1" exists, nested somewhere within a list (derived from JSON or not, but JSON is the main culprit for deeply nested lists), how do I find it's path, i.e., l2[["x"]][[2]][[1]] or l2/"x"/2/1?Defant
OK, thanks. I'm not really sure why you have the last section of code in your question--it seems like you don't want to get ""test_noname2", just "test_noname1", but then the last section ("and then generalise it") just re-adds the content you were previously trying to filter out?Energumen
I found it a little tough to come up with a good example. I'm just saying that I need the full path so that I can make the cut at another arbitrary point to get, e.g., the siblings of the value I already know is there. The last part is not really part of the question, but more background so that someone does not come up with an answer that simply let's me search the value I already know is there without giving me the path.Defant
B
7

If the question is how to get the path given the contents of a cell then using rrapply from the package of the same name

library(rrapply)

ix <- rrapply(l2, 
  condition = \(x) x == "test_noname1",
  f = \(x, .xpos) .xpos,
  how = "flatten")

unlist(ix)
## 11 12 13 
##  1  2  1 

l2[[unlist(ix)]]
## [1] "test_noname1"

library(purrr)
pluck(l2, !!!unlist(ix))
## [1] "test_noname1"

Note

Input from question

l2 <- list(x = list("no_match", list("test_noname1", "test_noname2")),
           y = list(a = "test_name"))
Brashear answered 17/3 at 11:51 Comment(1)
rrapply os perfect! The function also has two other special special arguments, .xparents and .xname that are also super interesting. Thanks!Defant
U
3

Here is a way with the jsonStrings package:

library(jsonStrings)
library(jsonlite)

l2 <- list(x = list("no_match", list("test_noname1", "test_noname2")),
           y = list(a = "test_name"))
## make a jsonString
jstring <- jsonString$new(toJSON(l2, auto_unbox = TRUE))
## get all paths
paths <- jstring$flatten()$keys()
# "/x/0"   "/x/1/0" "/x/1/1" "/y/a"  
## test each path
vapply(paths, function(path) {
  jspatch <- list(list(op = "test", path = path, value = "test_noname2")) |> 
    toJSON(auto_unbox = TRUE)
  !inherits(try(jstring$patch(jspatch), silent = TRUE), "try-error")
}, logical(1L)) |> which() |> names()
# "/x/1/1
Umbles answered 18/3 at 10:0 Comment(0)
P
2

Update

@JBGruber points out in a comment that I didn't really talk about discovering key or value paths, which was the point of the original question.

I updated the GitHub version of rjsoncons to include new functionality j_find_values(), j_find_values_grep(), j_find_keys(), j_find_keys_grep() to directly enable this.

Until available on CRAN, install with

remotes::install_github("mtmorgan/rjsoncons")

Use as

json = '{"x":["no_match",["test_noname1","test_noname2"]],"y":{"a":"test_name"}}'

j_find_values(json, "test_noname2", as = "data.frame")
##     path        value
## 1 /x/1/1 test_noname2

The key is a JSONpointer path, which j_query() supports (in addition to JMESpath and JSONpath). So the sibs are...

j_query(json, '/x/1', as = "R")
## [1] "test_noname1" "test_noname2"

The functions work directly with R objects, but the path returned is not so useful...

j_find_values(l2, "test_noname2", auto_unbox = TRUE, as = "data.frame")
##     path        value
## 1 /x/1/1 test_noname2

For more details see the help page and vignette section.

Original answer

I'll mention the CRAN package rjsoncons and JMESpath / JSONpath / JSONpointer as a way to query JSON documents directly to R objects.

I converted your R object back to json

> json = jsonlite::toJSON(l2, auto_unbox = TRUE)
> json
{"x":["no_match",["test_noname1","test_noname2"]],"y":{"a":"test_name"}}

And then explored it interactively using rjsoncons::j_query() and JMESpath

> j_query(json, "x[0]")  # JSON arrays are 0-based
[1] "no_match"
> j_query(json, "x[1]", as = "R")
[1] "test_noname1" "test_noname2"
> j_query(json, "x[1][0]", as = "R")
[1] "test_noname1"

Nested objects are queried using ., so

> j_query(json, "y.a")
[1] "test_name"

In practice I explore novel JSON documents using listviewer::jsonedit(json)

Performance and other considerations

In response to the comment by @CarlWhithoft, for problems of this size performance is obviously a very secondary consideration -- any successful approach will complete in a fraction of a second.

rjsoncons works best on the original JSON string, file or connection, rather than an R object coerced to JSON. In these cases the data is processed mostly in C, and only the result of interest returned to R. For small or medium sized JSON objects performance does not really matter, it is the flexibility of JMESpath / JSONpath, and JSONpointer that might make rjsoncons appealing, although the trade-off is learning an arcane query syntax versus an arcane set of R list manipulation commands.

One place where rjsoncons is particularly useful is when processing 'newline-delimited' JSON, where each line is a complete JSON object. Often these records have identical or very similar structure. There are several StackOverflow questions that are NDJSON-like, e.g., with each row in a data.frame a JSON object; see the rjsoncons Examples vignette or, e.g., Efficient conversion of json data in R on StackOverflow

An rjsoncons NDJSON vignette provides some real-world examples and performance comparison with competitors (DuckDB's JSON parser turns out to be really fast and scalable for the SQL pros...); there isn't a viable R-only competitor at the moment (the ndjson CRAN packages turns out to be quite slow for reasons that are not particularly clear; yyjsonr seems like it will have NDJSON parsing at some point (it was recently re-introduced on GitHub), which is likely to be 5-10x faster than rjsoncons but without the flexibility of queries; RcppSimdJson is also fast and supports JSONpointer, but JSONpointer is sometimes not flexible enough). See further disucssion in the rjsoncons NDJSON vignette.

As an aside, an R object can be translated to JSON and then queried with something like j_query(l2, "x[1]", auto_unbox = TRUE, as = "R"); this calls jsonlite::toJSON() 'under the hood'.

Priapitis answered 17/3 at 17:12 Comment(4)
Is this any faster than the OP's initial code hack?Schism
@CarlWitthoft I added a section on performance and other considerationsPriapitis
I wasn't aware of rjsoncons. Very cool! But from your code it's not quite clear to me how to use it for the actual problem: finding the path of value or key in a json object or nested list representation of a json string. I already know how to find specific values. My point was that I might want to find the siblings of a key value pair later, and so need the path to that already known value.Defant
yeah @Defant I agree I kind of lost sight of your question. I updated rjsoncons to directly support finding keys and values, and updated my response with the new information.Priapitis

© 2022 - 2024 — McMap. All rights reserved.