How can I subset a list, based on list element details, using sapply?

T

4

5

I want to output a subset of a list of pathnames, conditional on characters in the list elements (prefix in the filename). I can get a for loop to work, but I want to do it using sapply because I'm guessing it's a better approach.

The for loop works; the result is a list of pathnames with filenames starting with 500 or higher. Sapply does not work; i does not iterate as I'm expecting, but there might be other problems appending the element to the new list.

# -----------------------------------------
# add pathnames to files that have a prefix of less than 500,
# to a new list
# -----------------------------------------

# make list
# myList <- list.files(path = "D:/test", pattern = "*.txt")
myList <- list("D:/test/472_a.txt", "D:/test/303_b.txt", "D:/test/500_a.txt", "D:/test/505_b.txt", "D:/test/700_a.txt")
# preallocate subsetted list
myListSubset <- vector("list", length = length(myList))


# -----------------------------------------
# for loop - this works
# -----------------------------------------
for (i in 1:length(myList)) {
  print(paste("i is", i))
  print(paste(i, "element of myList is", myList[i]))
  swath <- str_sub(basename(paste(myList[i], collapse = "")), 1, 3)
  # only add swaths ge to 500 to the subsetted list
  if (swath >= 500) {
    print(paste("swath #", swath))
    myListSubset[[i]] <- paste(myList[i], collapse = "")
  }
} 
# remove Null elements
print(myListSubset)
myListSubset[sapply(myListSubset, is.null)] <- NULL
print(myListSubset)
# -----------------------------------------



# -----------------------------------------
# sapply - this does not work
# -----------------------------------------
i <- 1
sapply(myList, function(s){
  swath <- str_sub(basename(s), 1, 3) # swath is the 1st 3 digits in file name
  print(paste("swath #", swath))
  if (swath >= 500) {
    print(paste("list element is", s, "and the class is", class(s)))
    print(paste("i is", i, "and the class is", class(i)))
    myListSubset[[i]] <- s
    i <- (i + 1)
    print(paste("i is", i))
  }
}
)
# remove Null elements
print(myListSubset)
myListSubset[sapply(myListSubset, is.null)] <- NULL
print(myListSubset)
# -----------------------------------------

Towne answered 27/9, 2024 at 15:18 Comment(1)

I should have mentioned that the basename will always be 3 digits followed by an underscore. E.g., xxx_*.txt. But other numbers can be in the path or basename (after the "_"). – Towne 27/9, 2024 at 18:0

L

3

Using sapply on the list and then gsub on the path to get the number.

myList[sapply(myList, \(x) 
  as.numeric(gsub(".*/(\\d+)_.*", "\\1", x)) >= 500)]

output

[[1]]
[1] "D:/test/500_a.txt"

[[2]]
[1] "D:/test/505_b.txt"

[[3]]
[1] "D:/test/700_a.txt"

Leeway answered 27/9, 2024 at 16:6 Comment(2)

thank you, this solution works. I tried changing some of the pathnames - there are differences but the basename will always be 3 digits followed by and underscore. One question if you can follow up - can we isolate the list element to just the basename? I'm not up to speed on regular expressions so I'm not sure if variations in the pathname (so not the text file name) would mess it up. – Towne 27/9, 2024 at 17:57

.*/ is greedy and will afaik always give you the basename (an exception would be if the path is just a dot without forward slash, but that would not be a valid path). If you're sure the name will always be XXX_ you can change \\d+ to \\d{3}, but you do not have to be that specific. – Leeway 27/9, 2024 at 18:24

R

4

Like this? No loops.

myList <- list("D:/test/472_a.txt", "D:/test/303_b.txt", "D:/test/500_a.txt", "D:/test/505_b.txt", "D:/test/700_a.txt")

i <- myList |>
  unlist() |>
  basename() |>
  grep("^[5-9]\\d\\d", x = _)

myList[i]
#> [[1]]
#> [1] "D:/test/500_a.txt"
#> 
#> [[2]]
#> [1] "D:/test/505_b.txt"
#> 
#> [[3]]
#> [1] "D:/test/700_a.txt"

^{Created on 2024-09-27 with reprex v2.1.0}

Rutger answered 27/9, 2024 at 15:19 Comment(2)

thank you, this solution works, even when I added extra numbers to the basenames after the "_" (what I have in reality). I will go with this solution as it seems robust and straightforward. However I will mark the other solution using sapply as the answer to be consistent with the question. – Towne 27/9, 2024 at 17:52

could you please tell me what "x = _" means in the grep function? I'm still learning regular expressions, thank you. – Towne 1/10, 2024 at 16:12

R

4

The base R function Filter() extracts the elements of a vector for which a predicate (logical) function gives true. We can use the regex "\\D+" to remove everything which is not a number and then compare the result to 500.

Filter(
    \(x) as.integer(sub("\\D+", "", basename(x))) >= 500,
    myList
)

# [[1]]
# [1] "D:/test/500_a.txt"

# [[2]]
# [1] "D:/test/505_b.txt"

# [[3]]
# [1] "D:/test/700_a.txt"

Reneerenegade answered 27/9, 2024 at 15:29 Comment(2)

strtoi is 4 bit less to write. :) Wouldn't be sub enough? (+1) – Thermae 27/9, 2024 at 15:45

@Thermae but as.integer() is more explicit :). You're right though sub() is better here, edited, thanks. – Reneerenegade 27/9, 2024 at 15:56

L

3

Using sapply on the list and then gsub on the path to get the number.

myList[sapply(myList, \(x) 
  as.numeric(gsub(".*/(\\d+)_.*", "\\1", x)) >= 500)]

output

[[1]]
[1] "D:/test/500_a.txt"

[[2]]
[1] "D:/test/505_b.txt"

[[3]]
[1] "D:/test/700_a.txt"

Leeway answered 27/9, 2024 at 16:6 Comment(2)

thank you, this solution works. I tried changing some of the pathnames - there are differences but the basename will always be 3 digits followed by and underscore. One question if you can follow up - can we isolate the list element to just the basename? I'm not up to speed on regular expressions so I'm not sure if variations in the pathname (so not the text file name) would mess it up. – Towne 27/9, 2024 at 17:57

.*/ is greedy and will afaik always give you the basename (an exception would be if the path is just a dot without forward slash, but that would not be a valid path). If you're sure the name will always be XXX_ you can change \\d+ to \\d{3}, but you do not have to be that specific. – Leeway 27/9, 2024 at 18:24

D

2

You could use keep +parse_number from tidyverse:

library(tidyverse)
keep(myList, ~parse_number(.) >= 500)

[[1]]
[1] "D:/test/500_a.txt"

[[2]]
[1] "D:/test/505_b.txt"

[[3]]
[1] "D:/test/700_a.txt"

Dubbin answered 27/9, 2024 at 16:16 Comment(0)

Recommended topics

Hot tags