Converting unit abbreviations to numbers
Asked Answered
W

5

5

I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?

I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).

Wisner answered 16/5, 2019 at 0:2 Comment(4)
You could call numfmt from system() if it is installed on your system and in the PATH. Something like system(paste("numfmt --from=auto --to=none", "12M"), intern = TRUE).Flavia
Related previous discussion - #36806715 including an answer in the comments which addresses this.Extragalactic
Also, seems we can safely assume normalized mantissas, in your case (12.00k or 0.012k are not normalized, for example)Cash
Oh and do you care about handling NA, NaN, Inf without blowing up?Cash
C
6
  • So you want to translate SI unit abbreviations ('K','M',...) into exponents, and thus numerical powers-of-ten. Given that all units are single-letter, and the exponents are uniformly-spaced powers of 10**3, here's working code that handles 'Kilo'...'Yotta', and any future exponents:
    > 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
    [1] 1e+12

Then just multiply that power-of-ten by the decimal value you have.

  • Also, you probably want to detect and handle the 'no-match' case for unknown letter prefixes, otherwise you'd get a nonsensical -1*3
    > unit_to_power <- function(u) {
        exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
        return (if(exp_>=0) exp_ else 1)
    }
  • Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, @DanielV's case-specific code is better.

  • If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum@KMGTPEY')-8) where @ is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.

  • base::regexpr is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate.

Cash answered 16/5, 2019 at 1:23 Comment(0)
H
3

Give this a shot:

Text_Num <- function(x){
    if (grepl("M", x, ignore.case = TRUE)) {
        as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
    } else if (grepl("k", x, ignore.case = TRUE)) {
        as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
    } else {
        as.numeric(x)
    }
}
Hanoi answered 16/5, 2019 at 0:6 Comment(1)
This is exactly the kind of code I was looking for, I didn't think to multiply after changing to numeric values. Only problem is it returns "NA" when it gets to the "k" values. I'm trying to figure out a work aroundSmolensk
C
1

In your case you can using gsubfn

a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03
Correspondence answered 16/5, 2019 at 0:40 Comment(0)
P
0

I am glad to meet you.

I wrote another answer

Define function

res = function (x) {
  result = as.numeric(x)
  if(is.na(result)){
  text = gsub("k", "*1e3", x, ignore.case = T)
  text = gsub("m", "*1e6", text, ignore.case = T)
  result = eval(parse(text = text))
  } 
  return(result)
}

Result

> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000
Propriety answered 16/5, 2019 at 3:22 Comment(0)
G
0

All other answers didn't work nicely with NA for me (or produce warnings, which is also not nice).

Here is my solution, which reuses some bits from other solutions. (also posted at R data.table speed up SI / Metric Conversion)

library(stringr)

si2num <- function(x)
{
  conv <- paste0("e", c(seq(-24 ,-3, by=3), -2, -1, seq(3, 24, by=3),3))
  names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","K","M","G","T","P","E","Z","Y","k")
  
  xout <- str_replace_all(x, conv)
  xout <- as.numeric(xout)
  
  return(xout)
}

x <- c(NA,"10", "10.01K",NA,"10.1M", "20K", "21k",NA)

si2num(x)
[1]       NA       10    10010       NA 10100000    20000    21000       NA
Gutbucket answered 4/10, 2023 at 21:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.