I have two columns of data representing the same quantity; one column is from my training data, the other is from my validation data.
I know how to calculate the percentile rankings of the training data efficiently using:
pandas.DataFrame(training_data).rank(pct = True).values
My question is, how can I efficiently get a similar set of percentile rankings of the validation data column relative to the training data column? That is, for each value in the validation data column, how can I find what its percentile ranking would be relative to all the values in the training data column?
I've tried doing this:
def percentrank(input_data,comparison_data):
rescaled_data = np.zeros(input_data.size)
for idx,datum in enumerate(input_data):
rescaled_data[idx] =scipy.stats.percentileofscore(comparison_data,datum)
return rescaled_data/100
But I'm not sure if this is even correct, and on top of that it's incredibly slow because it is doing a lot of redundant calculations for each value in the for loop.
Any help would be greatly appreciated!