split a vector by percentile
Asked Answered
C

5

9

I need to split a sorted unknown length vector in R into "top 10%,..., bottom 10%" So, for example if I have vector <- order(c(1:98928)), I want to split it into 10 different vectors, each one representing approximately 10% of the total length.

Ive tried using split <- split(vector, 1:10) but as I dont know the length of the vector, I get this error if its not multiple

data length is not a multiple of split variable

And even if its multiple and the function works, split() does not keep the order of my original vector. This is what split gives:

split(c(1:10) , 1:2)
$`1`
[1] 1 3 5 7 9

$`2`
[1]  2  4  6  8 10

And this is what I want:

$`1`
[1] 1 2 3 4 5

$`2`
[1]  6  7  8  9 10

Im newbie in R and Ive been trying lots of things without success, does anyone knows how to do this?

Catena answered 24/7, 2016 at 1:45 Comment(0)
R
8

Problem statement

Break a sorted vector x every 10% into 10 chunks.

Note there are two interpretation for this:

  1. Cutting by vector index:

    split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))
    
  2. Cutting by vector values (say, quantiles):

    split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))
    

In the following, I will make demonstration using data:

set.seed(0); x <- sort(round(rnorm(23),1))

Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.

Result

cutting by index

#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4

cutting by quantile

#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4
Ruprecht answered 24/7, 2016 at 2:9 Comment(0)
K
5

If you have your vector as a column (named vec) in a data frame, you can simply do something like this:

df$new_vec <- cut(df$vec , breaks = quantile(df$vec, c(0, .1,.., 1)), 
                labels=1:10, include.lowest=TRUE)
Kariekaril answered 1/2, 2018 at 0:25 Comment(1)
I know I shouldn't comment just to say thank you (hence the upvote) but I spent literally hours looking for this solution, and it worked so well. ThanksSeal
T
4
x <- 1:98
y <- split(x, ((seq(length(x))-1)*10)%/%length(x)+1)

Explanation:

seq(length(x)) = 1..98

seq(length(x))-1 = 0..97

(seq(length(x))-1)*10 = (0, 10, ..., 970)

# each number about 10% of values, totally 98
((seq(length(x))-1)*10)%/%length(x) = (0, ..., 0, 1, ..., 1, ..., 9, ..., 9) 

# each number about 10% of values, totally 98
seq(length(x))-1)*10)%/%length(x)+1 = (1, ..., 1, 2, ..., 2, ..., 10, ..., 10)  

# splits first ~10% of numbers to 1, next ~10% of numbers to 2 etc.
split(x, ((seq(length(x))-1)*10)%/%length(x)+1) 
Tetragram answered 24/7, 2016 at 2:12 Comment(0)
S
2

If the vector is sorted, then you could just create a group variable with the same length of vector and split on it. In real case, it will require a little more effort since the length of the vector may not be a multiple of 10 but for your toy example, you can do:

n = 2
split(x, rep(1:n, each = length(x)/n))
# $`1`
# [1] 1 2 3 4 5

# $`2`
# [1]  6  7  8  9 10

A real case example, where the vector's length is not a multiple of the number of groups:

vec = 1:13
n = 3
split(vec, sort(seq_along(vec)%%n))
# $`0`
# [1] 1 2 3 4

# $`1`
# [1] 5 6 7 8 9

# $`2`
# [1] 10 11 12 13
Sarazen answered 24/7, 2016 at 2:6 Comment(0)
V
0

You can use the sum() function to determine the positions to extract a section of the vector. Using a logical operator greater than (>) or less than (<) the percentile value you are indicating. Since sum() assigns the value of 1 if TRUE and 0 if FALSE. It is important to order the elements of the vector first.

# A vector with numbers from 1 to 100
data <- seq(1,100)

# 25th percentile value and 75th percentile value
ps1 <- quantile(data,probs=c(0.25))
ps2 <- quantile(data,probs=c(0.75))

# Positions to split
position1 <- sum(data<=ps1)
position2 <- sum(data<=ps2)

# Split with positions in a sorted data
sort(data)[position1:position2]

The result is

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

In the same way you can divide an ordered vector into 10 equal parts in the following way, specifying the percentiles

# A vector with numbers from 1 to 100
data <- seq(1,100)

# sub vectors based on percentiles
subvectors <- quantile(data,probs=c(0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1))

for (i in 1:length(subvectors)-1){
  
  # Percentiles values
  ps1 <- subvectors[i]
  ps2 <- subvectors[i+1]
  
  # Positions to split
  position1 <- sum(data<=ps1)
  position2 <- sum(data<=ps2)
  
  # Split with positions in a sorted data
  print(sort(data)[position1:position2])
}
Vomitory answered 24/10, 2022 at 20:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.