Alternative of pivot_wider for large data set
Asked Answered
T

3

7

I have a large data set (35.8 GB, over 1 billion rows) that I need to tidy.

Here is some reproducible code to show its format.

id <- c(1, 1, 2, 2)
item <- c("a1", "a2", "a1", "a2")
itemformat <- c("AA", "AB", "BB", "AB")
full <- cbind(id, item, itemformat)
full <- as.data.table(full)
full

This code works for this example:

full2 <- full %>% 
  pivot_wider(names_from = item, values_from = itemformat)
full2

However, when using this method on the massive dataset, R has either said that a vector of size 3.8 GB is too large (I restarted R and used the gc() function) or just crashed (after working for over 30 minutes).

I have noticed that fread() and fwrite() from the data.table package are much faster than anything else I've used (e.g., read.csv() and read_csv()).

So, is there a way (per the data.table package, perhaps?) to rearrange this data into having roughly 800,000 columns?

Also, is 800,000 columns above some sort of limit?

After tidying this, the file size should go down (hopefully).

Side note: I have 64 GB of RAM.

Thank you!

Taboret answered 20/10, 2022 at 0:35 Comment(10)
Not sure if it will work, but dcast is the data.table equivalent, as shown here at one of the canonical questions for reshaping: https://mcmap.net/q/86974/-how-to-reshape-data-from-long-to-wide-format , with the full documentation here: cran.r-project.org/web/packages/data.table/vignettes/…Bisutun
Rent cloud based hardware for the few minutes it's required. You have exceeded desktop PC sensibilities. Don't fight the future.Fritz
Curious: Why do you need to reshape from long to wide? Usually long makes data processing/tidying much easier; and 800,000 columns sounds quite unwieldy. That said, reading a 35.8 GB monolithic file is challenging in itself (nevermind the format). Is there an option to read from a database instead?Colvert
@Bisutun Thank you! I'll look into that. I tried looking up the functions, but my quick search turned up dry. Thank you for the links!Taboret
@Fritz Thank you for the response! I just might look into that lol.Taboret
@MauritsEvers Thank you for the response! The data is genetic data, and I am trying to tidy it so that one row is for one person and each column is a single nucleotide polymorphism (SNP). I want to be able to treat each SNP as a variable so I can do some logistic regression-type stuff. I hope that helps explain my though process.Taboret
Do you actually need all 800,000 SNPs? If you’re only interested in a subset, you could filter before pivoting. Otherwise I agree you may need a cloud-based solution.Greenhorn
@ScottHebert So you are planning to do a regression model with 800,000 SNPs as predictors? That sounds like a challenge from a modelling point of view (sparsity, collinearities, etc.). I agree with zephryl, is there a way to reduce the number of SNPs? I did some work on this many years ago and remember a lot of discussions on ways to identify functional & relevant SNPs. Either way, I don't think there is a simple plug-n-play solution here.Colvert
@Greenhorn I plan to do some dimension reduction to only use the most relevant SNPs (based on whatever method of dimension reduction I use), but I'd like to get the data into this format first. If I could do this with participants each taking up multiple rows, though, I would do that, though I don't know of a way to make that work. Thank you for your response, by the way!Taboret
@MauritsEvers I plan to do dimension reduction (LASSO or something similar) to only use the best-predicting SNPs, though I'd like to tidy the data first. If I can do dimension reduction with participants each covering hundreds of thousands of rows, I would do that, though I don't know of a way to make that work.Taboret
T
2

Update: I was able to do this with this format:

full2 <- full %>% 
  dcast(id ~ item, value.names = "itemformat")
full2

I deleted my actual code (dumb idea), but I believe that is the format I used. It worked for the sample data set. It was pretty quick and worked for 2 different large data sets.

Taboret answered 31/10, 2022 at 17:6 Comment(0)
G
7

I generally agree with the comments that such a large dataset will be hard to work with locally. That said, you could try pivoting in chunks, saving each chunk to disk and then re-loading and binding all the pivoted chunks.

set.seed(13)
library(tidyr)
library(dplyr)
library(purrr)

# example data: 9 cases with 3 SNPs each
(data_long <- expand_grid(id = 1:9, item = paste0("a", 1:3)) |>
  mutate(itemformat = sample(c("AA", "AB", "BB"), 27, replace = TRUE)))
#> # A tibble: 27 × 3
#>       id item  itemformat
#>    <int> <chr> <chr>     
#>  1     1 a1    BB        
#>  2     1 a2    AA        
#>  3     1 a3    AB        
#>  4     2 a1    AA        
#>  5     2 a2    AB        
#>  6     2 a3    AB        
#>  7     3 a1    AA        
#>  8     3 a2    BB        
#>  9     3 a3    AA        
#> 10     4 a1    AB        
#> # … with 17 more rows

# define chunks to process 3 cases at a time
ids <- unique(data_long$id)
cases_per_chunk <- 3
chunks <- tibble(
    start = seq(length(ids), by = cases_per_chunk),
    stop = start + cases_per_chunk - 1
  ) |>
  mutate(filename = paste0("SNPs_wide", row_number(), ".rds"))

# pivot and save in chunks of 3 cases
pwalk(chunks, \(start, stop, filename) {
  data_long |>
    filter(id %in% ids[start:stop]) |>
    pivot_wider(names_from = item, values_from = itemformat) |>
    saveRDS(filename)
})

# load pivoted chunks and row-bind
(data_wide <- map_dfr(chunks$filename, readRDS))
#> # A tibble: 9 × 4
#>      id a1    a2    a3   
#>   <int> <chr> <chr> <chr>
#> 1     1 BB    AA    AB   
#> 2     2 AA    AB    AB   
#> 3     3 AA    BB    AA   
#> 4     4 AB    BB    BB   
#> 5     5 BB    AA    AA   
#> 6     6 AB    AA    BB   
#> 7     7 AA    AA    BB   
#> 8     8 BB    AB    AA   
#> 9     9 AA    BB    BB

Created on 2022-10-20 with reprex v2.0.2

A few other thougts:

  • You could also tinker with reading your initial dataset in chunks (say, 800K rows at a time), then pivoting and saving each chunk as above.
  • Tools optimized for large datasets (e.g., arrow) might be worth a look.
  • Where did these data come from? Is it possible, and perhaps easier, to do something upstream to get the data into a wide format?
  • I don't know much about Bioconductor, but it's a project focused on tools for bioinformatics in R, so I would poke around there if you haven't already.
Greenhorn answered 20/10, 2022 at 13:55 Comment(1)
Thank you for your thorough comment! I didn't get notified of this, so I've already been able to tidy the data using dcast(), but I imagine that your answer will help others out in the future. It also may help me if I face a problem similar to this again. Also, since the data is wider and thus tidy, it has now gone to under 3 GB, which is much more manageable. Anyway, thanks again!Taboret
R
4

A billion rows is pushing the limits for R on a personal computer in my experience; if you're open to other alternatives, you could process the data out-of-memory using AWK:

cat full.csv
"id","item","itemformat"
"1","a1","AA"
"1","a2","AB"
"2","a1","BB"
"2","a2","AB"

awk 'BEGIN{ FS=OFS="," }
NR==FNR {
    if (NR > 1 ) {
        items[$2]
    }
    next
}
FNR==1 {
    printf "%s", $1
    for (item in items) {
        printf "%s%s", OFS, item
    }
    print ""
    next
}
$1 != prev[1] {
    if ( FNR > 2 ) {
        prt()
    }
    split($0,prev)
}
{ 
  values[$2] = $3 
}
END { prt() }

function prt(item, value) {
    printf "%s", prev[1]
    for (item in items) {
        value = values[item]
        printf "%s%s", OFS, value
    }
    print ""
    delete values
}' full.csv full.csv > full2.csv

cat full2.csv
"id","a1","a2"
"1","AA","AB"
"2","BB","AB"

Edit

I think @zephryl's answer is the best solution to your problem (+1), but it might be worth doing some benchmarking to make sure it doesn't require more resources than you have available, or an unreasonable amount of time. For my AWK answer, with an example dataset of a billion rows (~14Gb), this method uses max 500Mb RAM and takes ~45mins to do the pivot on my laptop (MacBook Pro 2017, 2.3GHz Dual-Core i5, 8GB RAM):

# Create a large example dataset (~1 billion rows)
awk 'BEGIN{print "id,item,itemformat"; for (i=800000; i<1000000000; i++) {printf "%d,%s%d,%s\n", i/800000, "a", i%800000+1, (rand()>0.5?"AA":"AB")}}' > full.csv

head full.csv
id,item,itemformat
1,a1,AA
1,a2,AA
1,a3,AB
1,a4,AA
1,a5,AA
1,a6,AA
1,a7,AB
1,a8,AB
1,a9,AA

## "Items" printed in any order
awk 'BEGIN{ FS=OFS="," }
NR==FNR {
    if (NR > 1 ) {
        items[$2]
    }
    next
}
FNR==1 {
    printf "%s", $1
    for (item in items) {
        printf "%s%s", OFS, item
    }
    print ""
    next
}
$1 != prev[1] {
    if ( FNR > 2 ) {
        prt()
    }
    split($0,prev)
}
{
  values[$2] = $3
}
END { prt() }

function prt(item, value) {
    printf "%s", prev[1]
    for (item in items) {
        value = values[item]
        printf "%s%s", OFS, value
    }
    print ""
    delete values
}' full.csv full.csv > full2.csv


###############################

# "Items" printed in the 'original' order
awk '
BEGIN{ FS=OFS="," }
NR==FNR {
    PROCINFO["sorted_in"] = "@val_num_asc"
    if (NR > 1 ) {
        items[$2]=NR
    }
    next
}
FNR==1 {
    printf "%s", $1
    for (item in items) {
        printf "%s%s", OFS, item
    }
    print ""
    next
}
$1 != prev[1] {
    if ( FNR > 2 ) {
        prt()
    }
    split($0,prev)
}
{
  values[$2] = $3
}
END { prt() }

function prt(item, value) {
    printf "%s", prev[1]
    for (item in items) {
        value = values[item]
        printf "%s%s", OFS, value
    }
    print ""
    delete values
}' test.csv test.csv > output.csv
Receive answered 20/10, 2022 at 2:44 Comment(2)
Thank you for your comment! I wasn't notified of these comments, so I apologize for my delayed response. AWK is great -- I've used it a couple times with this data! Fortunately, though, I was able to tidy the data with dcast(). However, your answer is great and may prove to be useful to others (and perhaps me) in the future! Thanks again!Taboret
Glad you solved your problem @ScottHebert! Please take the time to 'answer your own question' (stackoverflow.com/help/self-answer) detailing the dcast() approach, as I suspect others will stumble across this question and want to know how you handled it.Receive
T
2

Update: I was able to do this with this format:

full2 <- full %>% 
  dcast(id ~ item, value.names = "itemformat")
full2

I deleted my actual code (dumb idea), but I believe that is the format I used. It worked for the sample data set. It was pretty quick and worked for 2 different large data sets.

Taboret answered 31/10, 2022 at 17:6 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.