- I have a numpy matrix with shape of (4601, 58).
- I want to split the matrix randomly as per 60%, 20%, 20% split based on number of rows
- This is for Machine Learning task I need
- Is there a numpy function that randomly selects rows?
you can use numpy.random.shuffle
import numpy as np
N = 4601
data = np.arange(N*58).reshape(-1, 58)
np.random.shuffle(data)
a = data[:int(N*0.6)]
b = data[int(N*0.6):int(N*0.8)]
c = data[int(N*0.8):]
A complement to HYRY's answer if you want to shuffle consistently several arrays x, y, z with same first dimension: x.shape[0] == y.shape[0] == z.shape[0] == n_samples
.
You can do:
rng = np.random.RandomState(42) # reproducible results with a fixed seed
indices = np.arange(n_samples)
rng.shuffle(indices)
x_shuffled = x[indices]
y_shuffled = y[indices]
z_shuffled = z[indices]
And then proceed with the split of each shuffled array as in HYRY's answer.
If you want to randomly select rows, you could just use random.sample
from the standard Python library:
import random
population = range(4601) # Your number of rows
choice = random.sample(population, k) # k being the number of samples you require
random.sample
samples without replacement, so you don't need to worry about repeated rows ending up in choice
. Given a numpy array called matrix
, you can select the rows by slicing, like this: matrix[choice]
.
Of, course, k
can be equal to the number of total elements in the population, and then choice
would contain a random ordering of the indices for your rows. Then you can partition choice
as you please, if that's all you need.
Since you need it for machine learning, here is a method I wrote:
import numpy as np
def split_random(matrix, percent_train=70, percent_test=15):
"""
Splits matrix data into randomly ordered sets
grouped by provided percentages.
Usage:
rows = 100
columns = 2
matrix = np.random.rand(rows, columns)
training, testing, validation = \
split_random(matrix, percent_train=80, percent_test=10)
percent_validation 10
training (80, 2)
testing (10, 2)
validation (10, 2)
Returns:
- training_data: percentage_train e.g. 70%
- testing_data: percent_test e.g. 15%
- validation_data: reminder from 100% e.g. 15%
Created by Uki D. Lucas on Feb. 4, 2017
"""
percent_validation = 100 - percent_train - percent_test
if percent_validation < 0:
print("Make sure that the provided sum of " + \
"training and testing percentages is equal, " + \
"or less than 100%.")
percent_validation = 0
else:
print("percent_validation", percent_validation)
#print(matrix)
rows = matrix.shape[0]
np.random.shuffle(matrix)
end_training = int(rows*percent_train/100)
end_testing = end_training + int((rows * percent_test/100))
training = matrix[:end_training]
testing = matrix[end_training:end_testing]
validation = matrix[end_testing:]
return training, testing, validation
# TEST:
rows = 100
columns = 2
matrix = np.random.rand(rows, columns)
training, testing, validation = split_random(matrix, percent_train=80, percent_test=10)
print("training",training.shape)
print("testing",testing.shape)
print("validation",validation.shape)
print(split_random.__doc__)
- training (80, 2)
- testing (10, 2)
- validation (10, 2)
© 2022 - 2024 — McMap. All rights reserved.