How to use readLines in R to read all lines between a certain range?
Asked Answered
S

1

5

I am trying to split a large JSONL(.gz) file into a number of .csv files. I have been able to use the code below to create a working .csv file, for the first 25.000 entries. I now want to read and parse the 25.001 to the 50.000th line, and have been unable to do so. I feel like it should be easily done, but my search has been fruitless thus far.

Is there a way to manipulate the 'n' factor in the readLiness function to select a specific range of lines?

(p.s. I'm learning;))

setwd("filename")

a<-list.files(pattern="(.*?).0.jsonl.gz")
a[1]

raw.data<- readLines(gzfile(a[1]), warn = "T",n=25000) 
rd <- fromJSON(paste("[",paste(raw.data,collapse=','),']'))
rd2<-do.call("cbind", rd) 

file=paste0(a,".csv.gz") 
write.csv.gz(rd2, file, na="", row.names=FALSE)
Silesia answered 3/9, 2018 at 12:50 Comment(0)
L
8

The read_lines() function within the readr package is faster than base::readLines(), and can be used to specify a start and end line for the read. For example:

library(readr)
myFile <- "./data/veryLargeFile.txt"

first25K <- read_lines(myFile,skip=0,n_max = 25000)

second25K <- read_lines(myFile,skip=25000,n_max=25000) 

Here is a complete, working example using the NOAA StormData data set. The file describes the location, event type, and damage information for over 900,000 extreme weather events in the United States between 1950 and 2011. We will use readr::read_lines() to read the first 50,000 lines in groups of 25,000 after downloading and unzipping the file.

Warning: the zip file is about 50Mb.

library(R.utils) 
library(readr)
dlMethod <- "curl"
if(substr(Sys.getenv("OS"),1,7) == "Windows") dlMethod <- "wininet"
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,destfile='StormData.csv.bz2',method=dlMethod,mode="wb")
bunzip2("StormData.csv.bz2","StormData.csv")

first25K <- read_lines("StormData.csv",skip=0,n_max = 25000)

second25K <- read_lines("StormData.csv",skip=25000,n_max=25000)

...and the objects as viewed in the RStudio Environment Viewer:

enter image description here

Here are the performance timings comparing base::readLines() with readr::read_lines() on an HP Spectre x-360 laptop with an Intel i7-6500U processor.

> # check performance of readLines()
> system.time(first25K <- readLines("stormData.csv",n=25000))
   user  system elapsed 
   0.05    0.00    0.04 
> # check performance of readr::read_lines()
> system.time(first25K <- read_lines("StormData.csv",skip=0,n_max = 25000))
   user  system elapsed 
   0.00    0.00    0.01 
Lightproof answered 3/9, 2018 at 22:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.