Pandas Dataframe: Multiplying Two Columns
Asked Answered
I

3

7

I am trying to multiply two columns (ActualSalary * FTE) within the dataframe (OPR) to create a new column (FTESalary), but somehow it has stopped at row 21357, I don't understand what went wrong or how to fix it. The two columns came from importing a csv file using the line: OPR = pd.read_csv('OPR.csv', encoding='latin1')

[In] OPR
[out]
ActualSalary    FTE
44600           1
58,000.00       1
70,000.00       1
17550           1
34693           1
15674           0.4

[In] OPR["FTESalary"] = OPR["ActualSalary"].str.replace(",", "").astype("float")*OPR["FTE"]
[In] OPR
[out]
ActualSalary    FTE FTESalary
44600           1   44600
58,000.00       1   58000
70,000.00       1   70000
17550           1   NaN
34693           1   NaN
15674           0.4 NaN

I am not expecting any NULL values as an output at all, I am really struggling with this. I would really appreciate the help. Many thanks in advance! (I am new to both coding and here, please let me know via message if I have made mistakes or can improve the way I post questions here)

Sharing the data @oppresiveslayer

[In] OPR[0:6].to_dict()
[out]
{'ActualSalary': {0: '44600',
1: '58,000.00',
2: '70,000.00',
3: '39,780.00',
4: '0.00',
5: '78,850.00'},
 'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}}

For more information on the two columns @charlesreid1

[in] OPR['ActualSalary'].astype
[out]
Name: ActualSalary, Length: 21567, dtype: object>

[in] OPR['FTE'].astype
[out]
Name: FTE, Length: 21567, dtype: float64>

The version I am using: python: 3.7.3, pandas: 0.25.1 on jupyter Notebook 6.0.0

Imitable answered 12/12, 2019 at 23:39 Comment(3)
Possibly a duplicate of #14059594Buffum
It is a good read for me, a lot to learn, but unfortunately not quite the same issue I am experiencing here. Thanks though @BuffumImitable
By construct the dataframes, are you referring to how I have those data in the first place? I have loaded it from a csv file. The ActualSalary is dtype: object, whereas FTE is dtype: float64. I will amend my question to include this nowImitable
V
5

I believe that your ActualSalary column is a mix of strings and integers. That is the only way I've been able to recreate your error:

df = pd.DataFrame(
    {'ActualSalary': ['44600', '58,000.00', '70,000.00', 17550, 34693, 15674],
     'FTE': [1, 1, 1, 1, 1, 0.4]})

>>> df['ActualSalary'].str.replace(',', '').astype(float) * df['FTE']
0    44600.0
1    58000.0
2    70000.0
3        NaN
4        NaN
5        NaN
dtype: float64

The issue arises when you try to remove the commas:

>>> df['ActualSalary'].str.replace(',', '')
0       44600
1    58000.00
2    70000.00
3         NaN
4         NaN
5         NaN
Name: ActualSalary, dtype: object

First convert them to strings, before converting back to floats.

fte_salary = (
    df['ActualSalary'].astype(str).str.replace(',', '')  # Remove commas in string, e.g. '55,000.00' -> '55000.00'
    .astype(float)  # Convert string column to floats.
    .mul(df['FTE'])  # Multiply by new salary column by Full-Time-Equivalent (FTE) column.
)
>>> df.assign(FTESalary=fte_salary)  # Assign new column to dataframe.
      ActualSalary  FTE  FTESalary
    0        44600  1.0    44600.0
    1    58,000.00  1.0    58000.0
    2    70,000.00  1.0    70000.0
    3        17550  1.0    17550.0
    4        34693  1.0    34693.0
    5        15674  0.4     6269.6
Vertumnus answered 13/12, 2019 at 1:12 Comment(6)
I have previously used the .mul method as well but yielded the same problem I am facing, unfortunately.Imitable
@Imitable I cannot replicate your error. What version of python and pandas are you using?Vertumnus
python: 3.7.3, pandas: 0.25.1 I am using the jupyter Notebook 6.0.0 @VertumnusImitable
I think the issue is that the ActualSalary column is initially a mix of integers and strings. Try first to cast it to strings and then continue as above, ie. df['ActualSalary'].astype(str).str.replace(',', '').astype(float).mul(df['FTE'])Vertumnus
Thank you so much @Alexander, that fixed the issue!! The whole having to cast it to strings and then continue with my original code! I am so happy, however I don't understand why I have to cast it to string first if it's a mixture? Thank you once again!Imitable
Because calling str.replace() on an integer gives you a null so effectively you are converting None to float and multiplying None and FTE hence the NaN. You convert them all to string so even the integers are strings so calling str.replace() gives the string with desired character replacements.Natasha
F
0

This should work:

OTR['FTESalary'] = OTR.apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1)

output

  ActualSalary  FTE  FTESalary
0        44600  1.0    44600.0
1    58,000.00  1.0    58000.0
2    70,000.00  1.0    70000.0
3        17550  1.0    17550.0
4        34693  1.0    34693.0
5        15674  0.4     6269.6

ok, i think you need to do this:

OTR['FTESalary'] = OTR.reset_index().apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1).to_numpy().tolist() 
Filet answered 12/12, 2019 at 23:52 Comment(7)
I have tried applying your code to mine but have this error below (I am still early stage of learning the meaning of these messages...) AttributeError: ("'int' object has no attribute 'replace'", 'occurred at index 20480')Imitable
@sylvia, what is the output of pd.__version__ i think i need to install you version to see the error message. I don't mind doing so, so i can get a working versionFilet
it is '0.25.1' (thanks for teaching me how to check the version) @FiletImitable
@Imitable Ok, i added an update, can you try that. I got the same error as you, so i think i fixed it. IT's actually not a bug, but because we need to reset_index since you have an index set already.Filet
I still have the same error: AttributeError: ("'int' object has no attribute 'replace'", 'occurred at index 20480') Is this because I don't know how to 'clean' the raw csv file? @FiletImitable
I get that error like that if i remove reset_index: KeyError: ('ActualSalary', 'occurred at index 44600'). Would you be able to give us your data, df[0:6].to_dict() ?Filet
( ActualSalary FTE 0 44600 1.0 1 58,000.00 1.0 2 70,000.00 1.0 3 39,780.00 1.0 4 0.00 1.0 ... ... ... 21562 70000 1.0 21563 55000 1.0 21564 59872 1.0 21565 93000 1.0 21566 52000 1.0 [21567 rows x 2 columns], {'ActualSalary': {0: '44600', 1: '58,000.00', 2: '70,000.00', 3: '39,780.00', 4: '0.00', 5: '78,850.00'}, 'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}}) @Filet I've removed other columns of original dataframeImitable
B
0

I was able to do it in a couple steps, but with list comprehension which might be less readable for a beginner. It makes an intermediate column, which does the float conversion, since your ActualSalary column is full of strings at the start.

OPR["X"] = [float(x.replace(",","")) for x in OPR["ActualSalary"]]
OPR["FTESalary"] = OPR["X"]*OPR["FTE"]
Barnebas answered 12/12, 2019 at 23:56 Comment(1)
I have used your code above and unfortunately I am still having the same issue.Imitable

© 2022 - 2024 — McMap. All rights reserved.