I am analysing a dataset having 200 rows and 1200 columns, this dataset is stored in a .CSV
file. In order to process, I read this file using R's read.csv()
function.
R takes ≈ 600 seconds to read this dataset. Later I got an idea and I transposed the data inside .CSV
file and tried to read it again using read.csv()
function. I was amazed to see that it only took ≈ 20 seconds. As you can see, it was ≈ 30 times faster.
I verified it for following iterations:
Reading 200 rows and 1200 columns (Not transposed)
> system.time(dat <- read.csv(file = "data.csv", sep = ",", header = F))
user system elapsed
610.98 6.54 618.42 # 1st iteration
568.27 5.83 574.47 # 2nd iteration
521.13 4.73 525.97 # 3rd iteration
618.31 3.11 621.98 # 4th iteration
603.85 3.29 607.50 # 5th iteration
Reading 1200 rows and 200 columns (Transposed)
> system.time(dat <- read.csv(file = "data_transposed.csv",
sep = ",", header = F))
user system elapsed
17.23 0.73 17.97 # 1st iteration
17.11 0.69 17.79 # 2nd iteration
20.70 0.89 21.61 # 3rd iteration
18.28 0.82 19.11 # 4th iteration
18.37 1.61 20.01 # 5th iteration
In any data-set we take observations in rows and columns contain variables to-be observed. Transpose changes this structure of data. Is it a good practice to transpose the data for processing, even though it makes data look weird?
I am wondering what makes R read datasets fast when I transposed the data. I am sure it is because earlier dimensions were 200 * 1200
which became 1200 * 200
after transpose operation.
Why R reads data fast when I transpose the data?
Update : Research & experiments
I initially asked this question because my RStudio was taking long time to read and compute a highly dimensional dataset (many columns as compare to rows [200 rows, 1200 columns]). I was using built-in R function read.csv()
. I read the comments below, as per their suggestions later I experimented with read.csv2()
and fread()
function they all work well but they perform slowly for my original dataset [200 rows * 1200 columns] and they read transposed data-set faster.
I observed that this is also valid for MS-Excel and Libre office Calc too. I even tried to open it into Sublime Text editor and even for this text editor it was easy(fast) to read transposed data. I am still not able to figure out the reason why all these applications behave so. All these apps get into trouble if your data has many columns as compare to rows.
So to wrap up whole story, I have only 3 question.
- What kind of issue is it? Is it related to operating systems or is it application level problem?
- Is it a good practice to transpose the data for processing?
- Why R and/or other apps reads my data fast when I transpose the data?
My experiments perhaps helped me to rediscover some 'already known' wisdom, but I couldn't find anything relevant on internet. Kindly share such good programming/data analysis practices.
csv
format, will it be same if I use some other data formats – Bluff[.CSV]
is most popular data format. Probably you are right, I will experiment with read_csv() for sure, Thanks. Meanwhile I tried to read these 2 data-sets using sublime text editor, and as I expected results were same. Sublime takes much time to read 'not-transposed' dataset. On the other hand it loads transposed data in relatively less amount of time. – BluffcolClasses=
are (see their docs). Besides guessing classes, fread is parallelized, but only over rows, not columns (as far as I know), which could also explain its performance diff. You could also read verbose output withfread(..., verbose=TRUE)
to see the operations it's taking. – Manchukuo