Counting the number of rows of a series of csv files

Asked 16/1, 2013 at 12:36 Answered 29/5, 2021 at 17:10

I'm working through an R tutorial and suspect that I have to use one of these functions but I'm not sure which (Yes I researched them but until I become more fluent in R terminology they are quite confusing).

In my working directory there is a folder "specdata". Specdata contains hundreds of CSV files named 001.csv - 300.csv.

The function I am working on must count the total number of rows for an inputed number of csv files. So if the argument in the function is 1:10 and each of those files has ten rows, return 100.

Here's what I have so far:

complete <- function(directory,id = 1:332) {
    setpath <- paste("/Users/gcameron/Desktop",directory,sep="/")
    setwd(setpath)
    csvfile <- sprintf("%03d.csv", id)
    file <- read.csv(csvfile)
    nrow(file)
 }

This works when the ID argument is one number, say 17. But, if I input say 10:50 as an argument, I receive an error:

Error in file(file, "rt") : invalid 'description' argument

What should I do to be able to count the total number of rows from the inputed ID parameter?

Bornie answered 16/1, 2013 at 12:36 Comment(0)

read.csv expects to read just one file, so you need to loop over files, a R idiomatic way of doing so is to use sapply:

nrows <- sapply( csvfile, function(f) nrow(read.csv(f)) )
sum(nrows)

For example, here is a rewrite of your complete function:

complete <- function(directory,id = 1:332) {
    csvfiles <- sprintf("/Users/gcameron/Desktop/%s/%03d.csv", directory, id)
    nrows <- sapply( csvfiles, function(f) nrow(read.csv(f)) )
    sum(nrows)
}

Skirling answered 16/1, 2013 at 12:44 Comment(5)

Thanks. So few follow up Q's if you have a sec. 1) Where do I put this line? Within the function "complete" or after it? 2) If after it, do I not have to declare the object csvfile again for scope? 3) Your parameter "function(f)" - is that just the name of the function I made in it's place? i.e. nrows <- sapply( csvfile, complete(f# what goes here?) nrow(read.csv(f) As you can no doubt tell I'm struggling a bit with this – Bornie 16/1, 2013 at 12:50

I've edited my answer. also note that your original function never reset the working directory when it is done, that's wrong. – Skirling 16/1, 2013 at 12:54

Thanks a ton I'm really grateful for this. That has worked. Having now seen it I can make sense of it. – Bornie 16/1, 2013 at 12:59

length(count.fields(f)) is probably a lot quicker than nrow(read.csv(f)). (You can test this hypothesis with system.time.) – Fluent 16/1, 2013 at 14:14

I get a "Error in cc$id : $ operator is invalid for atomic vectors" error from trying this method. – Monogamous 20/1, 2014 at 2:57

Homework problems usually get tagged as such, though I don't know if that is required, but this clearly is homework.

Your function as written expects that id is not a vector (despite the default value being a vector of integers).

Change it to either use one of the *apply functions (more concise and common), or even an explicit loop. For each element in the id vector, you must call a function that opens that file and counts the observations.

This stackoverflow post has a good explanation of the differences between the *apply functions.

Sadfaced answered 16/1, 2013 at 12:47 Comment(3)

The homework tag is deprecated. – Humphreys 16/1, 2013 at 12:49

ok, thanks. I looked to see if that was covered in the faq, but didn't see it. I still think it's useful to know when it is homework, as I'm willing to provide a complete answer for someone trying to finish something at work, but would rather give hints and direction for homework. – Sadfaced 16/1, 2013 at 12:54

This is indeed a good point, and I thought too that they would have added that to the faq since most users don't read the blog or metaSE. – Humphreys 16/1, 2013 at 13:3

id <-c(1:332)
filenames=list.files(path="source_path", full.names=TRUE)

for(a in id){

    dataset <- read.csv(filenames[a])

    res <- nrow(na.exclude(dataset))  #nrow count the row of the dataset
    
    df <-data.frame(
    id =a,
    nobs =res,  
    stringsAsFactors=FALSE)
}   

df

Eskimoaleut answered 9/11, 2020 at 13:52 Comment(0)

complete <- function(directory, id = 1:332){
  mylist <- list.files(path = directory, pattern = ".csv")
  result <- data.frame()
  for(i in id){
    my_data <- read.csv(paste(directory,mylist[i],sep=""))
    res <- nrow(na.exclude(my_data))  #nrow count the row of the dataset
    df <- data.frame("id" = i,"nobs" = res,  stringsAsFactors=FALSE)
    result <- rbind(result,df)
  }
  return(result)
}

Waterworks answered 29/5, 2021 at 17:10 Comment(0)

Recommended topics

Hot tags