Pandas Dataframe: Multiplying Two Columns

Asked 12/12, 2019 at 23:39 Answered 13/12, 2019 at 1:12

Solved python pandas matrix-multiplication

I am trying to multiply two columns (ActualSalary * FTE) within the dataframe (OPR) to create a new column (FTESalary), but somehow it has stopped at row 21357, I don't understand what went wrong or how to fix it. The two columns came from importing a csv file using the line: OPR = pd.read_csv('OPR.csv', encoding='latin1')

[In] OPR
[out]
ActualSalary    FTE
44600           1
58,000.00       1
70,000.00       1
17550           1
34693           1
15674           0.4

[In] OPR["FTESalary"] = OPR["ActualSalary"].str.replace(",", "").astype("float")*OPR["FTE"]
[In] OPR
[out]
ActualSalary    FTE FTESalary
44600           1   44600
58,000.00       1   58000
70,000.00       1   70000
17550           1   NaN
34693           1   NaN
15674           0.4 NaN

I am not expecting any NULL values as an output at all, I am really struggling with this. I would really appreciate the help. Many thanks in advance! (I am new to both coding and here, please let me know via message if I have made mistakes or can improve the way I post questions here)

Sharing the data @oppresiveslayer

[In] OPR[0:6].to_dict()
[out]
{'ActualSalary': {0: '44600',
1: '58,000.00',
2: '70,000.00',
3: '39,780.00',
4: '0.00',
5: '78,850.00'},
 'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}}

For more information on the two columns @charlesreid1

[in] OPR['ActualSalary'].astype
[out]
Name: ActualSalary, Length: 21567, dtype: object>

[in] OPR['FTE'].astype
[out]
Name: FTE, Length: 21567, dtype: float64>

The version I am using: python: 3.7.3, pandas: 0.25.1 on jupyter Notebook 6.0.0

Imitable answered 12/12, 2019 at 23:39 Comment(3)

Possibly a duplicate of #14059594 – Buffum 12/12, 2019 at 23:46

It is a good read for me, a lot to learn, but unfortunately not quite the same issue I am experiencing here. Thanks though @Buffum – Imitable 13/12, 2019 at 0:45

By construct the dataframes, are you referring to how I have those data in the first place? I have loaded it from a csv file. The ActualSalary is dtype: object, whereas FTE is dtype: float64. I will amend my question to include this now – Imitable 13/12, 2019 at 1:24

I believe that your ActualSalary column is a mix of strings and integers. That is the only way I've been able to recreate your error:

df = pd.DataFrame(
    {'ActualSalary': ['44600', '58,000.00', '70,000.00', 17550, 34693, 15674],
     'FTE': [1, 1, 1, 1, 1, 0.4]})

>>> df['ActualSalary'].str.replace(',', '').astype(float) * df['FTE']
0    44600.0
1    58000.0
2    70000.0
3        NaN
4        NaN
5        NaN
dtype: float64

The issue arises when you try to remove the commas:

>>> df['ActualSalary'].str.replace(',', '')
0       44600
1    58000.00
2    70000.00
3         NaN
4         NaN
5         NaN
Name: ActualSalary, dtype: object

First convert them to strings, before converting back to floats.

fte_salary = (
    df['ActualSalary'].astype(str).str.replace(',', '')  # Remove commas in string, e.g. '55,000.00' -> '55000.00'
    .astype(float)  # Convert string column to floats.
    .mul(df['FTE'])  # Multiply by new salary column by Full-Time-Equivalent (FTE) column.
)
>>> df.assign(FTESalary=fte_salary)  # Assign new column to dataframe.
      ActualSalary  FTE  FTESalary
    0        44600  1.0    44600.0
    1    58,000.00  1.0    58000.0
    2    70,000.00  1.0    70000.0
    3        17550  1.0    17550.0
    4        34693  1.0    34693.0
    5        15674  0.4     6269.6

Vertumnus answered 13/12, 2019 at 1:12 Comment(6)

I have previously used the .mul method as well but yielded the same problem I am facing, unfortunately. – Imitable 13/12, 2019 at 1:27

@Imitable I cannot replicate your error. What version of python and pandas are you using? – Vertumnus 13/12, 2019 at 1:37

python: 3.7.3, pandas: 0.25.1 I am using the jupyter Notebook 6.0.0 @Vertumnus – Imitable 13/12, 2019 at 1:42

I think the issue is that the ActualSalary column is initially a mix of integers and strings. Try first to cast it to strings and then continue as above, ie. df['ActualSalary'].astype(str).str.replace(',', '').astype(float).mul(df['FTE']) – Vertumnus 13/12, 2019 at 1:46

Thank you so much @Alexander, that fixed the issue!! The whole having to cast it to strings and then continue with my original code! I am so happy, however I don't understand why I have to cast it to string first if it's a mixture? Thank you once again! – Imitable 13/12, 2019 at 2:0

Because calling str.replace() on an integer gives you a null so effectively you are converting None to float and multiplying None and FTE hence the NaN. You convert them all to string so even the integers are strings so calling str.replace() gives the string with desired character replacements. – Natasha 28/12, 2019 at 19:33

This should work:

OTR['FTESalary'] = OTR.apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1)

output

  ActualSalary  FTE  FTESalary
0        44600  1.0    44600.0
1    58,000.00  1.0    58000.0
2    70,000.00  1.0    70000.0
3        17550  1.0    17550.0
4        34693  1.0    34693.0
5        15674  0.4     6269.6

ok, i think you need to do this:

OTR['FTESalary'] = OTR.reset_index().apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1).to_numpy().tolist()

Filet answered 12/12, 2019 at 23:52 Comment(7)

I have tried applying your code to mine but have this error below (I am still early stage of learning the meaning of these messages...) AttributeError: ("'int' object has no attribute 'replace'", 'occurred at index 20480') – Imitable 13/12, 2019 at 0:31

@sylvia, what is the output of pd.__version__ i think i need to install you version to see the error message. I don't mind doing so, so i can get a working version – Filet 13/12, 2019 at 0:37

it is '0.25.1' (thanks for teaching me how to check the version) @Filet – Imitable 13/12, 2019 at 0:39

@Imitable Ok, i added an update, can you try that. I got the same error as you, so i think i fixed it. IT's actually not a bug, but because we need to reset_index since you have an index set already. – Filet 13/12, 2019 at 0:47

I still have the same error: AttributeError: ("'int' object has no attribute 'replace'", 'occurred at index 20480') Is this because I don't know how to 'clean' the raw csv file? @Filet – Imitable 13/12, 2019 at 0:54

I get that error like that if i remove reset_index: KeyError: ('ActualSalary', 'occurred at index 44600'). Would you be able to give us your data, df[0:6].to_dict() ? – Filet 13/12, 2019 at 0:58

(      ActualSalary  FTE  0            44600  1.0  1        58,000.00  1.0  2        70,000.00  1.0  3        39,780.00  1.0  4             0.00  1.0  ...            ...  ...  21562        70000  1.0  21563        55000  1.0  21564        59872  1.0  21565        93000  1.0  21566        52000  1.0    [21567 rows x 2 columns],  {'ActualSalary': {0: '44600',    1: '58,000.00',    2: '70,000.00',    3: '39,780.00',    4: '0.00',    5: '78,850.00'},   'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}})

@Filet I've removed other columns of original dataframe – Imitable 13/12, 2019 at 1:8

I was able to do it in a couple steps, but with list comprehension which might be less readable for a beginner. It makes an intermediate column, which does the float conversion, since your ActualSalary column is full of strings at the start.

OPR["X"] = [float(x.replace(",","")) for x in OPR["ActualSalary"]]
OPR["FTESalary"] = OPR["X"]*OPR["FTE"]

Barnebas answered 12/12, 2019 at 23:56 Comment(1)

I have used your code above and unfortunately I am still having the same issue. – Imitable 13/12, 2019 at 0:34

Recommended topics

Hot tags