So here's the situation. I've got an 85 Million row table with 18 columns. Three of these columns have values in Metric Prefix / SI notation (See Metric Prefix on Wikipedia).
This means I have number like :
- .1M instead of 100000 or 1e+5, or
- 1K instead of 1000 or 1e+3
Sample data.table is
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1: 2014-03-25 12:15:12 58300 3010 44.0 4.5 0.0 0 0 0.8 50 0.8 10K 303 21K 0 a 56
2: 2014-03-25 12:15:12 56328 3010 28.0 12.0 0.0 0 0 0.3 60 0.0 59 62 .1M 0 a 66
3: 2014-03-25 12:15:12 21082 3010 10.0 1.7 0.0 0 0 14.0 72 0.3 4K 208 8K 1 a 80
4: 2014-03-25 12:15:12 59423 3010 12.0 0.0 0.2 0 0 88.0 0 0.0 20 16 71 0 a 26
5: 2014-03-25 12:15:12 59423 3010 9.6 1.4 0.0 0 0 60.0 29 0.2 2K 251 6K 0 a 56
6: 2014-03-25 12:15:12 24193 3010 8.3 1.9 0.0 0 0 9.9 80 0.3 3K 264 8K 1 a 71
7: 2014-03-25 12:15:12 21082 3010 7.1 1.7 0.4 0 0 6.3 83 0.3 3K 197 7K 0 a 71
8: 2014-03-25 12:15:12 59423 3010 4.6 1.2 0.0 0 0 57.0 37 0.1 998 81 7K 0 a 118
I modified a function written by Hans-Jörg Bibiko who used it to modify ggplot2 scales. See website here if you are iterested. The function I ended up using is :
sitor <- function(x)
{
conv <- paste("E", c(seq(-24 ,-3, by=3), -2, -1, 0, seq(3, 24, by=3)), sep="")
names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","","K","M","G","T","P","E","Z","Y")
x <- as.character(x)
num <- function(x) as.numeric(
paste(
strsplit(x,"[A-z|µ]")[[1]][3],
ifelse(substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1) == "",
"",
conv[substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1)]
),
sep=""
)
)
return(lapply(x,num))
}
I apply it to by data table to update 3 columns like
temp[ ,`:=`(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]
I have applied a data.table key vector to the temp table with
setkeyv(temp,c("V1","V2","V3","V18"))
Any 61 minutes later I am still here waiting for a result... Some tips on how to speed up this conversion would be really handy given that my data size is about to grow 4 to 5 times.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4878 neurozen 20 0 18.7g 18g 11m R 100.1 62.8 63:38.95 rsession
– Diogenessetkeyv(temp,c("V1","V2","V3","V18"))
ortemp[ ,
:=(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]
? Why are you trying to sort temp? – Cranmersitor
returns a list... are your column in dt of type list? – Brycebrynsitor
function totemp
with temp[ ,:=(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ] is taking 61 minutes (it is actually still going). I'm not intending to sort temp, it just will be after the key is applied.setkeyv
ontemp
takes ~90 seconds. – Diogenessitor
arecharacter
class. Are you suggesting that I shouldn't be doing thelapply
in teh return value for the function? Tell me more... – Diogenessapply
or leave it and wrap withunlist
(the latter is faster) – Brycebrynsystem.time
and gotuser = 186.559 system = 0.067 elapsed = 186.804
So at this rate (assuming linear) it should take ~ 4hours 24 mins. yikes! – Diogenesstrsplit
s andsubstr
s outside of yournum
function, since the first "strsplits" a vector iteratively (returning a list), and the second "substrs" a list iteratively. Then, you'd only have tolapply
a modifiednum
function which -basically- contains apaste
. Depending on your output, you couldunlist
and useifelse
outside ofnum
, too. – Fiddlesticksi2f
across 3 columns of the data table for 83.7 million rows took 268.888 seconds! – Diogenes