Finding n lowest values for each row in a dataframe
Asked Answered
E

5

6

I have a large datarframe with 1739 rows and 1455 columns. I want to find the 150 lowest values for each row (Not the the 150 th value but 150 values).

I iterate over rows with a basic for loop.

I tried df.min(axis=1) but it only gives out one min. And also the rolling_min function without success.

Is there any existing function where i can enter the number of values i want to find witn .min?

My ultimate goal is to take the 150 lowest values and create a slope then calculate the area under the curve. Do this for each row and add the areas to obtain a volume.

Example of the the dataframe, I have a df that looks like this:

     -218.7     -218.4    ...          217.2      217.5
0     56.632706  13.638315    ...      76.543000  76.543000
1     56.633455  13.576762    ...      76.543000  76.543000
2    -18.432203 -18.384091    ...      76.543000  76.543000
3    -18.476594 -18.439804    ...      76.543000  76.543000

The header is the '-218.7 ...' which are the coordinates in the x axis of a scan. The data is the height of the scan the y axis. What i need is the 150 lowest values for each rows and there associated column header as i want to make a curve for each row then calculate the area under the curve.

So i need for each line something like this :

         -218.7     -218.4    ... for 150 columns
4    -18.532035 -18.497517    ... for 150 values

I don't think i need to store the header info for each line, a for loop would go trough each row one at a time.

Electoral answered 25/2, 2019 at 19:25 Comment(6)
can you sort the array? If so, try sorting it, saving it in a temporary array and getting a sublist of 150 elementsCooe
Just to clarify for my own head, you want the smallest 150 values in each row, for all 1739 rows?Mailbag
@Cooe yeah that could work if i can get to sort the array. But then i would have to sort it out 1739 times (one for each row).Electoral
No, you just sort along the 1st axis: np.sort(df.values, 1)[:, 0:150]Plaice
@Mailbag yes that is exactly what i need.Electoral
If you care about efficiency, then you should use a linear search instead of sorting. Just iterate through each row and store the 150 smallest values in a list; this will take O(rows * columns) whereas sorting would take O((rows * columns) * lg(rows*columns)).Jaymejaymee
P
4

Use .argsort to get the indices of the underlying array sorted. Slice the values and the column Index to get all of the information you need. We'll create a MultiIndex so we can store both the column headers and values in the same DataFrame. The first level will be your nth lowest indicator.

Example:

import pandas as pd
import numpy as np

np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,100000, (1739, 26)))
df.columns = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

N = 7  # 150 in your case
idx = np.argsort(df.values, 1)[:, 0:N]

pd.concat([pd.DataFrame(np.take_along_axis(df.to_numpy(), idx, axis=1), index=df.index),
           pd.DataFrame(df.columns.to_numpy(), index=df.index)],
           keys=['Value', 'Columns'], axis=1)

Output:

      Value                                           Columns                  
          0      1      2      3      4      5      6       0  1  2  3  4  5  6
0      5193   7752   8445  19947  20610  21441  21759       C  K  U  V  I  G  P
1       432   3607  16278  17138  19434  26104  33879       R  J  W  C  B  D  G
2        16   1047   1845   9553  12314  13784  19432       K  S  E  F  M  O  U
3       244   5272  10836  13682  29237  33230  34448       K  Q  A  S  X  W  G
4      9765  11275  13160  22808  30870  33484  42760       K  T  L  U  C  D  M
5      2034   2179   4980   7184  14826  15238  22807       Z  H  F  Q  L  R  X
...
Plaice answered 25/2, 2019 at 19:49 Comment(3)
@Plaice Thanks, this works great! I just need to find each curve and ttheir area under and I am all good to go!Electoral
Do you an idea how to create a curve with Columns 0 and Value 0 ? I tried a basic plt.scatter but it does not work?Electoral
@Electoral I think that is different enough to warrant a new question. It will likely involve some groupby or slicing of the MultiIndex but just depends if you want all plots or only need to look at a few at a time.Plaice
E
1

If I understand correctly, the question boils down to getting the k smallest numbers in a list of M (>k) numbers. This shall then be applied to each row individually.

If numpy is available and order does not matter, you could try using argpartition: With given parameter k, it partitions an array in a way that assuming the kth element is put into its sorted position, all smaller numbers are before, all larger numbers behind (in unspecified order):

import numpy as np
row = np.array([1, 6, 2, 12, 7, 8, 9, 11, 15, 26])
k = 5
idx = np.argpartition(row, k)[:k]

print(idx)
print(row[idx])

-->
[1 0 2 4 5]
[6 1 2 7 8]

Edit: This also works row/wise for full arrays:

import numpy as np
data = np.array([
    [1, 6, 2, 12, 7, 8, 9, 11, 15, 26],
    [1, 65, 2, 12, 7, 8, 9, 11, 15, 26],
    [16, 6, 2, 12, 7, 8, 9, 11, 15, 26]])
k = 5
idx = np.argpartition(data, k)[:,:k]

print(idx)

-->
[[1 0 2 4 5]
 [2 0 4 5 6]
 [4 2 1 5 6]]
Evasion answered 25/2, 2019 at 19:41 Comment(2)
This would work if there was a way to make them in order, I need it to build the curve after that ...Electoral
@Electoral What do you mean with 'make them in order'? - Sorting the 150 entries per row afterwards would not be a problem, I think. If you need the original order, you would have to sort idx before using it.Evasion
F
1

You can use heapq.nsmallest to find the n smallest numbers in a list. This can be quickly applied to each row of a dataframe using .apply:

import pandas as pd
import numpy as np
import heapq

df = pd.DataFrame(np.random.randn(1000, 1000))

# Find the 150 smallest values in each row
smallest = df.apply(lambda x: heapq.nsmallest(150, x), axis=1)

Each row of smallest is now a list of the 150 smallest values in the corresponding row in df.

This can be converted to a dataframe using:

smallest_df = pd.DataFrame(smallest.values.tolist())

This is now a dataframe where each row corresponds to each row in the original dataframe. There are 150 columns, with the 150 smallest values in each row of the original.

smallest_df.head()

smallest_df

Foxhole answered 25/2, 2019 at 19:47 Comment(3)
This would work perfectly if I could keep the original column header linked, as it is the position in x axis and the 150 values are the heights on the y axis. @FoxholeElectoral
Just so I understand: you want the columns where the lowest values occur for each row as well as the values themselves? Could you make a small example of the end dataframe? There's probably a way I can modify this answer.Foxhole
I added the info in the question, please see above and thanks for the help.Electoral
U
0

I don't know a way to do it without looping through the rows:

df = df.transpose()
for col in df.columns:
    min_values = df[col].sort_values()[0:150]
    # now calc slope/area
Untwist answered 25/2, 2019 at 19:38 Comment(2)
Why do you transpose df? Is there a way to loop through rows instead?Electoral
You can iterate through rows using: for i, row in df.iterrows() if you want. You would have to transform row into a np.array i think.Untwist
D
0

If you are iterating through the df using a for loop like you mention in the question you can simply just do this:

for index, row in df.iterrows(): # your loop
    new_row = sorted(row.values)[:150]
    # new_row should be a list with length 150.

Test:

import numpy
import pandas
import random

# generate dummy data
l = list(range(1600))
random.shuffle(l)
a = numpy.array(l)
a = a.reshape(40, 40) # columns x rows
dummy_df = pandas.DataFrame(a)

# dummy_df.shape = (40, 40)

smallest = []
for idx, row in dummy_df.iterrows():
    smallest.append(sorted(row.values)[:10])

new_df = pandas.DataFrame(numpy.array(smallest))
# new_df.shape = (40, 10)
Discretion answered 25/2, 2019 at 19:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.