Pandas copy value from one column to another based on a value third column that is not constant
Asked Answered
D

1

3

I have a large dataset that is one huge table that actually should be many tables. The headers are buried in the rows for the subsets.

My goal is to pull those headers out into a new column such that I can filter by that column to get the data I want (one header at a time). I've created an empty header column for this. There is always a series of 3 NaN values in the SCORE column where the first value in the NAME column in that series is the HEADER I want. So I'm thinking something about that relationship could be leveraged.

Current Pandas data frame has this structure:

s = '''HEADER,NAME,SCORE
NaN,Header 1,NaN
NaN,Random Junk,NaN
NaN,Random Junk,NaN
NaN,Ed,98
NaN,Gary,78
NaN,Floyd,89
NaN,Header 2,NaN
NaN,Random Junk,NaN
NaN,Random Junk,NaN
NaN,Mary,96
NaN,Steve,78'''

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(s))
HEADER   NAME              SCORE
NaN      Header 1          NaN
NaN      Random Junk       NaN
NaN      Random Junk       NaN
NaN      Ed                98
NaN      Gary              78
NaN      Floyd             89
...      ...               ...
NaN      Header 2          NaN
NaN      Random Junk       NaN
NaN      Random Junk       NaN
NaN      Mary              96
NaN      Steve             78

and I want this:

HEADER        NAME              SCORE
Header 1      Header 1          NaN
Header 1      Random Junk       NaN
Header 1      Random Junk       NaN
Header 1      Ed                98
Header 1      Gary              78
Header 1      Floyd             89
...           ...               ...
Header 2      Header 2          NaN
Header 2      Random Junk       NaN
Header 2      Random Junk       NaN
Header 2      Mary              96
Header 2      Steve             78

so I can then remove the NaN rows and get what I'm truly after which is this:

HEADER        NAME              SCORE
Header 1      Ed                98
Header 1      Gary              78
Header 1      Floyd             89
...           ...               ...
Header 2      Mary              96
Header 2      Steve             78

After much searching, I'm not able to figure out how to do this conditional editing like this. Would appreciate any help you can give.

Dubitation answered 4/12, 2021 at 4:30 Comment(4)
does your actual data contain NaN? like for ex: 98 nan 89?Purposeful
Yes, it appears as I wrote it above in the data frame. In the source material there was nothing there (empty value). It was scraped from a website using read_html and this is the resulting data frame from that.Dubitation
can the value of Ed Garry Floyd be nan?Purposeful
No, I've removed all NaN values from that column.Dubitation
E
4

The header lines occur when SCORE has 3 NaN and 1 non-NaN in sequence, so:

  1. Check for this condition using shift, isna, and notna.
  2. mask the HEADER column as NAME when this condition is met.
  3. ffill (forward fill) the new HEADER.
  4. dropna based on the SCORE.
is_header = df.SCORE.isna() & df.SCORE.shift(-1).isna() & df.SCORE.shift(-2).isna() & df.SCORE.shift(-3).notna()
df.HEADER = df.HEADER.mask(is_header, df.NAME).ffill()
df = df.dropna(subset=['SCORE'])

#       HEADER   NAME  SCORE
# 3   Header 1     Ed   98.0
# 4   Header 1   Gary   78.0
# 5   Header 1  Floyd   89.0
# 9   Header 2   Mary   96.0
# 10  Header 2  Steve   78.0
Engraft answered 4/12, 2021 at 4:49 Comment(2)
I was also thinking on the similar line but I am skeptic what will happen if original data has nan. Like Ed Score is nan (+1).Purposeful
yep good point, it's worth notingEngraft

© 2022 - 2024 — McMap. All rights reserved.