Pandas - get first n-rows based on percentage
Asked Answered
A

6

22

I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.

for example,

df.head(n=10)

Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set. How to do this in pandas.

I'm looking for a code like this,

df.head(frac=0.05)

Is there any simple way to get this?

Annamarieannamese answered 4/5, 2018 at 10:54 Comment(2)
are you looking for df.sample(frac=*).Ergener
@shivsn- No I don't need sample. I want first n% rows. But i want similar fashion of df.sample to df.headAnnamarieannamese
D
32

I want to pop first 5% of record

There is no built-in method but you can do this:

You can multiply the total number of rows to your percent and use the result as parameter for head method.

n = 5
df.head(int(len(df)*(n/100)))

So if your dataframe contains 1000 rows and n = 5% you will get the first 50 rows.

Derose answered 4/5, 2018 at 10:56 Comment(0)
L
3

I've extended Mihai's answer for my usage and it may be useful to people out there. The purpose is automated top-n records selection for time series sampling, so you're sure you're taking old records for training and recent records for testing.

# having 
# import pandas as pd 
# df = pd.DataFrame... 

def sample_first_prows(data, perc=0.7):
    import pandas as pd
    return data.head(int(len(data)*(perc)))

train = sample_first_prows(df)
test = df.iloc[max(train.index):]
Luting answered 22/5, 2020 at 11:4 Comment(0)
S
1

I also had the same problem and @mihai's solution was useful. For my case I re-wrote to:-

    percentage_to_take = 5/100
    rows = int(df.shape[0]*percentage_to_take)
    df.head(rows)

I presume for last percentage rows df.tail(rows) or df.head(-rows) would work as well.

Sjoberg answered 9/1, 2023 at 13:53 Comment(0)
M
0

may be this will help:

tt  = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
Midbrain answered 31/12, 2020 at 14:53 Comment(0)
W
0

With pandas 2.1 it is easier with quantile method in the example below we get first 10 percent and last 10 percent of data

 df.loc[
    (df['Column'] < df['Column'].quantile(.10)) |
    (df['Column'] > df['Column'].quantile(.90))
]
Westernmost answered 5/10, 2023 at 17:59 Comment(1)
This however assumes that one column at least has numerical dtype, which OP has not specified.Wittenberg
G
-3
df=pd.DataFrame(np.random.randn(10,2))
print(df)
          0         1
0  0.375727 -1.297127
1 -0.676528  0.301175
2 -2.236334  0.154765
3 -0.127439  0.415495
4  1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931  2.089305
7  0.075599  0.404521
8  1.836577 -0.762597
9  0.294883  0.540444

#70% of the Dataframe

part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
          0         1
8  1.836577 -0.762597
2 -2.236334  0.154765
5 -0.884309 -0.108502
6 -0.884931  2.089305
3 -0.127439  0.415495
1 -0.676528  0.301175
0  0.375727 -1.297127
Gish answered 12/8, 2020 at 13:15 Comment(1)
Thanks for the response but my requirement was taking top n% record. Sample returns random order rows.Annamarieannamese

© 2022 - 2024 — McMap. All rights reserved.