How to I pd.merge without creating a copy of the data?
Asked Answered
H

3

8

I am trying to join two dataframes together as follows:

df3 = pd.merge(df1,df2, how='inner', on='key')

where df1 and df2 are large datasets with millions of rows. Basically how do I join them without having to create a third dataframe df3.

I just want to join one onto next, changing the original. I don't have enough memory to run so much on my server so I need something more efficient.

Horsetail answered 11/12, 2018 at 9:37 Comment(6)
There's a merge function on the DataFrame object as well, so maybe try df1.merge(df2) with the same arguments?Donley
There is no inplace operation for pandas merging or joining or concatenate functions.Tengdin
I need to know two things: 1. are the key values unique; and 2. what are the column names. Tell me this much, and I might be able to help you.Dichotomy
Short answer: use more RAM. Longer answer: Use something like dask#Mansard
I second dask for this.Muco
@SpaceImpact as Andrew's answer and this link (pandas.pydata.org/pandas-docs/stable/reference/api/…) there is a merge method on the DataFrame object. What is the implementation? Does it still create a 3rd dataframe?Citizenry
C
0

change how argument to left (left will be the df1 in this example:

df1 = df1.merge(df2, on='key', how='left')

Catina answered 17/9, 2023 at 1:39 Comment(0)
B
0

I think during merge many duplicate rows were getting created, To fix this u may iterate the dataframe append each iterate data to list. Afterwards u can use pd.concat() with inner merged dataframe.

merged_df=[]
split_len=1000
for row_index in range(0,split_len,len(df3)):
   chunk_df=df2[row_index:row_index+split_len]
   merg_df = pd.merge(df1,chunk_df, how='inner', on='key')
   merg_df=merg_df.drop_duplicates()#if u don't want to remove duplicates, command this line.
   merged_df.append(merg_df)
final_df=pd.concat(merged_df)

Now u will not face memory issue. You can custom split_len as per your data

Blackness answered 20/11, 2023 at 9:58 Comment(0)
D
-3

You can try this. I am not sure how your data looks like so just guessing.

import pandas as pd

def merge_dataset(df1, df2):
    df1 = df1.merge(df2, how='inner', on='key')
    print(df1)
    return df1

if __name__ == '__main__':

  d1 = {'col1': [1, 2], 'key': [3, 4]}
  d2 = {'col2': [5,6], 'key': [3, 4]}
  df1 = pd.DataFrame(data=d1)
  df2 = pd.DataFrame(data=d2)
  # Debug 
  print(df1)
  print(df2)
  merge_dataset(df1, df2)
Dominquedominquez answered 11/12, 2018 at 10:1 Comment(5)
This does not answer the question as posted. Please give it another read.Dichotomy
@coldspeed hi, I think she wants to merge df1 with df2 by changing df1. I would like to hear your understanding as wellDominquedominquez
"Basically how do I join them without having to create a third dataframe df3." df1.merge(df2, how='inner', on='key') creates "df3" and assigns it to df1... does that make sense?Dichotomy
@coldspeed "creates "df3" and assigns it to df1" answer my question. Thanks for that. So it seems we are not run away from creating a copy?Dominquedominquez
Unfortunately, not in this case :-9Dichotomy

© 2022 - 2024 — McMap. All rights reserved.