R data.table speed up SI / Metric Conversion
Asked Answered
D

3

2

So here's the situation. I've got an 85 Million row table with 18 columns. Three of these columns have values in Metric Prefix / SI notation (See Metric Prefix on Wikipedia).

This means I have number like :

  • .1M instead of 100000 or 1e+5, or
  • 1K instead of 1000 or 1e+3

Sample data.table is

          V1     V2   V3  V4  V5  V6 V7 V8 V9  V10 V11 V12 V13 V14 V15 V16 V17 V18
 1: 2014-03-25 12:15:12 58300 3010 44.0  4.5 0.0   0   0  0.8  50 0.8 10K 303 21K   0     a   56
 2: 2014-03-25 12:15:12 56328 3010 28.0 12.0 0.0   0   0  0.3  60 0.0  59  62 .1M   0     a   66
 3: 2014-03-25 12:15:12 21082 3010 10.0  1.7 0.0   0   0 14.0  72 0.3  4K 208  8K   1     a   80
 4: 2014-03-25 12:15:12 59423 3010 12.0  0.0 0.2   0   0 88.0   0 0.0  20  16  71   0     a   26
 5: 2014-03-25 12:15:12 59423 3010  9.6  1.4 0.0   0   0 60.0  29 0.2  2K 251  6K   0     a   56
 6: 2014-03-25 12:15:12 24193 3010  8.3  1.9 0.0   0   0  9.9  80 0.3  3K 264  8K   1     a   71
 7: 2014-03-25 12:15:12 21082 3010  7.1  1.7 0.4   0   0  6.3  83 0.3  3K 197  7K   0     a   71
 8: 2014-03-25 12:15:12 59423 3010  4.6  1.2 0.0   0   0 57.0  37 0.1 998  81  7K   0     a  118

I modified a function written by Hans-Jörg Bibiko who used it to modify ggplot2 scales. See website here if you are iterested. The function I ended up using is :

sitor <- function(x)
{
  conv <- paste("E", c(seq(-24 ,-3, by=3), -2, -1, 0, seq(3, 24, by=3)), sep="")
  names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","","K","M","G","T","P","E","Z","Y")
  x <- as.character(x)
  num <- function(x) as.numeric(
      paste(
        strsplit(x,"[A-z|µ]")[[1]][3],
        ifelse(substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1) == "",
               "",
               conv[substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1)]
        ),
        sep=""
      )
    )
  return(lapply(x,num))
}

I apply it to by data table to update 3 columns like

temp[ ,`:=`(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]

I have applied a data.table key vector to the temp table with

setkeyv(temp,c("V1","V2","V3","V18"))

Any 61 minutes later I am still here waiting for a result... Some tips on how to speed up this conversion would be really handy given that my data size is about to grow 4 to 5 times.

Diogenes answered 30/3, 2014 at 11:41 Comment(10)
Output from top : PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4878 neurozen 20 0 18.7g 18g 11m R 100.1 62.8 63:38.95 rsessionDiogenes
What is running 61 minutes? setkeyv(temp,c("V1","V2","V3","V18")) or temp[ ,:=(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]? Why are you trying to sort temp?Cranmer
sitor returns a list... are your column in dt of type list?Brycebryn
@Dave Applying the sitor function to temp with temp[ ,:=(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ] is taking 61 minutes (it is actually still going). I'm not intending to sort temp, it just will be after the key is applied. setkeyv ontemp takes ~90 seconds.Diogenes
@Brycebryn Good point. The class of the columns I am applying sitor are character class. Are you suggesting that I shouldn't be doing the lapply in teh return value for the function? Tell me more...Diogenes
you can do either change it to sapply or leave it and wrap with unlist (the latter is faster)Brycebryn
@Brycebryn Excellent point. I'll give that a go and see how it runs.Diogenes
I tried a subset of 1 million rows and timed it with system.time and got user = 186.559 system = 0.067 elapsed = 186.804 So at this rate (assuming linear) it should take ~ 4hours 24 mins. yikes!Diogenes
As a note, you could move strsplits and substrs outside of your num function, since the first "strsplits" a vector iteratively (returning a list), and the second "substrs" a list iteratively. Then, you'd only have to lapply a modified num function which -basically- contains a paste. Depending on your output, you could unlist and use ifelse outside of num, too.Fiddlestick
@Brycebryn Applying your si2f across 3 columns of the data table for 83.7 million rows took 268.888 seconds!Diogenes
B
2

Why don't you try sitools library?

library(data.table)
dt<-data.table(var = sample(x=1:1e5, size=1e6, replace=T))
library(sitools)
> system.time(dt[, var2 := f2si(var)])
   user  system elapsed 
  10.08    0.09   10.89

EDIT: this is a data.table based function that reverse f2si from sitools package:

si2f<-function(x){
  if(is.numeric(x)) return(x)
  require(data.table)
  dt<-data.table(lab=c("y","z","a","f","p","n","µ","m","c","d","", "da", "h", "k","M","G","T","P","E","Z","Y"),
                 mul=c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e-2, 1e-1, 1L, 10L, 1e2, 1e3, 1e6, 1e9, 1e12, 1e15, 1e18, 1e21, 1e24),
                 key="lab")
  res<-as.numeric(gsub("[^0-9|\\.]","", x))
  x<-gsub("[0-9]|\\s+|\\.","", x)
  .subset2(dt[.(x)], "mul")*res
}

> system.time(dt[, var3 := si2f(var2)])
   user  system elapsed 
  13.18    0.03   13.31 

> dt[, all.equal(var,var3)]
[1] TRUE
Brycebryn answered 30/3, 2014 at 13:20 Comment(10)
Wow that's awesome. I'm going to give it a go!Diogenes
I couldn't install the scitools package for my R version :` ‘scitools’ is not available (for R version 3.0.3)`Diogenes
@Diogenes it's sitoolsBrycebryn
@Michelle Thanks heaps. I've been up too long.Diogenes
The f2sci function is great but I actually want to do the reverse. As my sample data above, it has character units like K and M for kilo (1e3) and M for (1e6) in it and I want it as a number (with exponent or otherwise)Diogenes
@Diogenes oh I see. I saw that you applied your function to V13, V14 and V15 which seem to be just numbers so I thought you wanted this featureBrycebryn
@Michelle I like it! I am going to give it a go now and I will let you know the result.Diogenes
@Diogenes please mind I didn't use the capitol K but the smallBrycebryn
Applying your si2f across 3 columns of the data table for 83.7 million rows took 268.888 seconds!Diogenes
Michele I'll have to give you the tick for the answer as you come up with the key lookup concept first. @Data Munger I wish I could give give you a tick as well!Diogenes
V
1

Here is an approach that takes about 10 seconds on my computer to covert a vector with 10M values. You can extent it to cover more than "K", "M" & "G"

> f_conv <- function(val){
+     # create matrix indexed by name for exponent
+     key <- c(Zero = ""
+          , K = "E3"
+          , M = "E6"
+          , G = "E9"
+          )
+     # extract where the original exponent is 
+     indx <- regexpr("[KMG]", val)
+     # extract the exponent
+     exp <- substring(val, indx)
+     # if there was none, the use "Zero"
+     exp[indx == -1L] <- "Zero"
+     # put fake length
+     indx[indx == -1L] <- 20L
+     # do the conversion
+     as.numeric(paste0(substring(val, 1L, indx - 1L)
+                  , key[exp]
+                  )
+              )
+ }
> 
> # test data
> n <- 10000000
> result <- paste0(sample(1:999, n, TRUE)
+             , sample(c("K", "M", "G", ""), n, TRUE)
+             )
> 
> system.time(x <- f_conv(result))
   user  system elapsed 
   8.48    0.13    8.63 
> cbind(result[1:50], x[1:50])
      [,1]   [,2]          
 [1,] "562K" "562000"      
 [2,] "946"  "946"         
 [3,] "313G" "313000000000"
 [4,] "538M" "538000000"   
 [5,] "697K" "697000"      
 [6,] "486G" "486000000000"
 [7,] "814G" "814000000000"
 [8,] "842"  "842"         
 [9,] "993M" "993000000"   
[10,] "440K" "440000"      
[11,] "435G" "435000000000"
[12,] "407M" "407000000"   
[13,] "919K" "919000"      
[14,] "840"  "840"         
[15,] "766G" "766000000000"
[16,] "977"  "977"         
[17,] "139"  "139"         
[18,] "195G" "195000000000"
[19,] "609M" "609000000"   
[20,] "69"   "69"          
[21,] "147M" "147000000"   
[22,] "104M" "104000000"   
[23,] "509K" "509000"      
[24,] "951M" "951000000"   
[25,] "278"  "278"         
[26,] "797G" "797000000000"
[27,] "106K" "106000"      
[28,] "667K" "667000"      
[29,] "521K" "521000"      
[30,] "9"    "9"           
[31,] "17K"  "17000"       
[32,] "673M" "673000000"   
Vries answered 30/3, 2014 at 19:47 Comment(2)
I will give it a go. @Michelle @Data Munger FYI I also had a thought to use grep to select the rows which had metric units in them only temp[grep("[KMG]$",V1), V1:=sitor(V1)]). The catch was you have to return character then afterwards run as.numeric across the column again.Diogenes
I gave it a go and it works just fine. It took 237.6 seconds to run against 3 columns of a data table with 83.7 million rows. Thank you!Diogenes
B
0

All other answers didn't work nicely with NA for me (or produce warnings, which is also not nice).

Here is my solution, which reuses some bits from other solutions.

library(stringr)

si2num <- function(x)
{
  conv <- paste0("e", c(seq(-24 ,-3, by=3), -2, -1, seq(3, 24, by=3),3))
  names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","K","M","G","T","P","E","Z","Y","k")
  
  xout <- str_replace_all(x, conv)
  xout <- as.numeric(xout)
  
  return(xout)
}

x <- c(NA,"10", "10.01K",NA,"10.1M", "20K", "21k",NA)

si2num(x)
[1]       NA       10    10010       NA 10100000    20000    21000       NA
Brierroot answered 4/10, 2023 at 21:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.