Equal frequency discretization in R
Asked Answered
R

8

5

I'm having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the 'infotheo' package, but after some testing I found that the algorithm is broken. 'dprep' seems to no longer be supported on CRAN.

EDIT :

For clarity, I do not need to seperate the values between the bins. I really want equal frequency, it doesn't matter if one value ends up in two bins. Eg :

c(1,3,2,1,2,2) 

should give a bin c(1,1,2) and one c(2,2,3)

Ruelle answered 20/4, 2011 at 13:29 Comment(1)
as there seems to be quite a bit confusion about your real goal, I added some example.Milton
M
7

EDIT : given your real goal, why don't you just do (corrected) :

 EqualFreq2 <- function(x,n){
    nx <- length(x)
    nrepl <- floor(nx/n)
    nplus <- sample(1:n,nx - nrepl*n)
    nrep <- rep(nrepl,n)
    nrep[nplus] <- nrepl+1
    x[order(x)] <- rep(seq.int(n),nrep)
    x
}

This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :

x <- rpois(50,5)
y <- EqualFreq2(x,15)
table(y)
split(x,y)

Original answer:

You can easily just use cut() for this :

EqualFreq <-function(x,n,include.lowest=TRUE,...){
    nx <- length(x)    
    id <- round(c(1,(1:(n-1))*(nx/n),nx))

    breaks <- sort(x)[id]
    if( sum(duplicated(breaks))>0 stop("n is too large.")

    cut(x,breaks,include.lowest=include.lowest,...)

}

Which gives :

set.seed(12345)
x <- rnorm(50)
table(EqualFreq(x,5))

 [-2.38,-0.886] (-0.886,-0.116]  (-0.116,0.586]   (0.586,0.937]     (0.937,2.2] 
             10              10              10              10              10 

x <- rpois(50,5)
table(EqualFreq(x,5))

 [1,3]  (3,5]  (5,6]  (6,7] (7,11] 
    10     13     11      6     10 

As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.

Milton answered 20/4, 2011 at 13:48 Comment(14)
You could also use quantile to set up your breaks: table(cut(x,quantile(x),include.lowest=T))Ale
@Joris, @Ale - this function breaks on the following: test = c(1,1,1,1,1,1,1,1,2,2) EqualFreq(test,2). The ideal solution would not be tied to numeric breaks but to positional breaks. I'll try to work-up a solution but please do let me know if something comes to mindRuelle
@Ale : That's basically the same (I calculate the quantiles by hand, to allow for any kind of binning)Milton
@SFun28 : and how exactly do you plan to bin that? You have only 2 values, so there is no binning possible.Milton
@Joris - sure there is - put the first five values in bin1 and the second five in bin2 after sorting (its already sorted here). some 1's will wind up in bin 1 while others will wind up in bin 2 and the bins will have equal number of observations (or close enough if x is not evenly divisible by number of bins)Ruelle
@Joris - note that we're talking about a reduced example here...in reality maybe the breaks would always be unique, but the goal is to drive towards a solution that 1. doesn't break 2. puts equal number of values in each bin (or gets them as close as possible)Ruelle
@SFun28 but that definition of a "bin"is going to break the interpretation of the histogram is it not?Bateau
@SFun28 : if you only have 2 values, how are you going to cut? bin 1 is everything that is 1, and bin 2 is everything that is 2. So binning is exactly the value. Otherwise you can't define your bins. If you just want to cut it up, then say so.Milton
@Joris, @Gavin - I think you're implying that my method is somehow not accurate. In fact I don't believe there is a single definition for equal frequency discretization. I just want to cut my dataset up into equal chunks after sorting (which is A method for equal-freq discretiation but not THE method for it =). Perhaps there's a different solution for that?Ruelle
by the way, I've edited the title of the post to remove "histogram" - perhaps it was a mistake to use that term.Ruelle
@SFun28 : gave you an extra function that comes closest to what you are trying to do. But there is no possible way you can define the limits of each bin.Milton
@SFun28 : updated the code, there was a logical error in it. Now it's OK.Milton
@Joris - you've given me a number of tools to work with. thanks! The second solution doesn't seem to work with test = c(1,2,1,2,1,2,1,2,1,2) and EqualFreq2(test,2), but I think I can get it to work with what you've published. I'll post my final solution.Ruelle
@SFun28 : works perfect for me. After the update off course, I put the order() command at the wrong side of the assignment. But : split(test,EqualFreq2(test,2)) gives nicely two bins containing the seperate numbers.Milton
B
5

This sort of thing is also quite easily solved by using (abusing?) the conditioning plot infrastructure from lattice, in particular function co.intervals():

cutEqual <- function(x, n, include.lowest = TRUE, ...) {
    stopifnot(require(lattice))
    cut(x, co.intervals(x, n, 0)[c(1, (n+1):(n*2))], 
        include.lowest = include.lowest, ...)
}

Which reproduces @Joris' excellent answer:

> set.seed(12345)
> x <- rnorm(50)
> table(cutEqual(x, 5))

 [-2.38,-0.885] (-0.885,-0.115]  (-0.115,0.587]   (0.587,0.938]     (0.938,2.2] 
             10              10              10              10              10
> y <- rpois(50, 5)
> table(cutEqual(y, 5))

 [0.5,3.5]  (3.5,5.5]  (5.5,6.5]  (6.5,7.5] (7.5,11.5] 
        10         13         11          6         10

In the latter, discrete, case the breaks are different although they have the same effect; the same observations are in the same bins.

Bateau answered 20/4, 2011 at 14:25 Comment(3)
Your last answers really urge me to get a bit more into how to hack with lattice. Seems it has quite some nice functions to abuse. Thx for the tip +1Milton
thanks so much for this solution. Continuing to learn new things about R! (also thanks for your responses on other posts =)Ruelle
The solution seems great, but do you how you would modify the function if you want to split a dataframe based on the intervals of a single variable?Disrespect
F
5

How about?

a <- rnorm(50)
> table(Hmisc::cut2(a, m = 10))

[-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775) 
               10                10                10                10 
[ 0.9775, 2.5677] 
               10 
Floriated answered 21/4, 2011 at 8:28 Comment(1)
thx for the heads up to cut2 (although the problem of OP turned out to be of a different nature)Milton
B
1

The classInt library is created "for choosing univariate class intervals for mapping or other graphics purposes". You can just do:

dataset <- c(1,3,2,1,2,2) 

library(classInt)
classIntervals(dataset, 2, style = 'quantile')

where 2 is the number of bins you want and the quantile style provides quantile breaks. There are several styles available for this function: "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", or "jenks". Check docs for more info.

Blastula answered 26/11, 2017 at 0:1 Comment(0)
C
0

Here is a function that handle the error :'breaks' are not unique, and automatically select the closest n_bins value to the one you setted up.

equal_freq <- function(var, n_bins)
{
  require(ggplot2)

  n_bins_orig=n_bins

  res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})
  while(grepl("'breaks' are not unique", res[1]) & n_bins>1)
  {
    n_bins=n_bins-1
    res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})

  }
  if(n_bins_orig != n_bins)
    warning(sprintf("It's not possible to calculate with n_bins=%s, setting n_bins in: %s.", n_bins_orig, n_bins))

  return(res)
}

Example:

equal_freq(mtcars$carb, 10)

Which retrieves the binned variable and the following warning:

It's not possible to calculate with n_bins=10, setting n_bins in: 5.
Cellulose answered 7/12, 2015 at 19:23 Comment(0)
D
0

Here is a one liner solution inspired by @Joris' answer:

x <- rpois(50,5)
binSize <- 5
desiredFrequency = floor(length(x)/binSize)
split(sort(x), rep(1:binSize, rep(desiredFrequency, binSize)))
Desireedesiri answered 9/2, 2017 at 16:3 Comment(0)
N
0

Here's another solution using mltools.

set.seed(1)
x <- round(rnorm(20), 2)
x.binned <- mltools::bin_data(x, bins = 5, binType = "quantile")
table(x.binned)

x.binned
[-2.21, -0.622)   [-0.622, 0.1)    [0.1, 0.526)  [0.526, 0.844)    [0.844, 1.6] 
              4               4               4               4               4 
Nibelungenlied answered 29/11, 2017 at 20:39 Comment(1)
Is there a way to convert this extraordinarily beautifully typeset table into a vector understood by hist()'s breaks????? Many thanksVictor
O
0

We can use package cutr with feature what = "rough", the look of labels can be customized to taste :

# devtools::install_github("moodymudskipper/cutr")
library(cutr)
smart_cut(c(1, 3, 2, 1, 2, 2), 2, "rough", brackets = NULL, sep="-")
# [1] 1-2 2-3 1-2 1-2 2-3 2-3
# Levels: 1-2 < 2-3
Orcinol answered 16/10, 2018 at 22:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.