pandas three-way joining multiple dataframes on columns
Asked Answered
G

12

315

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?

The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

Giralda answered 15/5, 2014 at 2:51 Comment(4)
You don't need a multiindex. It states in the join docs that of you don't have a multiindex when passing multiple columns to join on then it will handle that.Exterminate
In my trials, df1.join([df2, df3], on=[df2_col1, df3_col1]) didn't work.Giralda
You need to chain them together like in the answer given. Merge df1 and df2 then merge the result with df3Exterminate
Creating an empty data fram and appending columns to it is the simplest solution I found after getting errors from solutions provided below: onecompiler.com/python/3zhtqjrut Source: geeksforgeeks.org/…Cestoid
D
645

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):

dfs = [df0, df1, df2, ..., dfN]

Assuming they have a common column, like name in your example, I'd do the following:

import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

Dora answered 28/5, 2015 at 17:8 Comment(8)
I just tried using this and it failed because reduce was replaced with functools.reduce So import functools functools.reduce(.......)Managing
How will this solution work if I the names of the fields to join are different? For example, in three data frames I could have name1, name2 and name3 respectively.Melpomene
Doesn't this mean that we have n-1 calls to the merge function? I guess in this case where the number of dataframes is small it doesn't matter, but I wonder if there's a more scalable solution.Camfort
This didn't quite work for my dfs with column multi indexes (it was injecting the 'on' as a column which worked for the first merge, but subsequent merges failed), instead I got it to work with: df = reduce(lambda left, right: left.join(right, how='outer', on='Date'), dfs)Crawford
+1 to ps0604. what if the join columns are different, does this work? should we go with pd.merge incase the join columns are different? thanksSuccussion
Thanks! What if df0,df1 have same columns to merge on and df0,df2 have same columns to merge on?Dorena
Thanks! This answer preserves the columns, other than the one mentioned in on, which is absolutely correct. concat wouldn't do it.Justin
What if they have no same column, but they only have same default column at the beginning of the dataframe? How to deal with this?Mosley
A
176

You could try this if you have 3 dataframes

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')
Amadavat answered 15/5, 2014 at 7:4 Comment(4)
For cleaner looks you can chain them df1.merge(df2,on='name').merge(df3,on='name')Exterminate
How will this solution work if I the names of the fields to join are different? For example, in three data frames I could have name1, name2 and name3 respectivelyMelpomene
@Melpomene df1.merge(df2,left_on='name1', right_on='name2').merge(df3,left_on='name1', right_on='name3').drop(columns=['name2', 'name3']).rename(columns={'name1':'name'})Rostock
and further, how to do this using the index. Doesn't seem to work if 'name' is the index and not a column name.Linnell
H
162

This is an ideal situation for the join method

The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

The code would look something like this:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero's data, you could do this:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9
Heptateuch answered 6/11, 2017 at 22:4 Comment(4)
Joining all of the dfs to an empty dataframe also works: pd.DataFrame().join(dfs, how="outer"). This can be cleaner in some situations.Xenocryst
This is decent advice and has now been incorporated into pandas merging 101 (see the section on merging multiple dataframes). It's worth noting that if your join keys are unique, using pd.concat will result in simpler syntax: pd.concat([df.set_index('name') for df in dfs], axis=1, join='inner').reset_index(). concat is also more versatile when dealing with duplicate column names across multiple dfs (join isn't as good at this) although you can only perform inner or outer joins with it.Turgor
dfs[0].join(dfs[1:]) should be edited to dfs[0].join(dfs[1:], sort=False) because otherwise a FutureWarning will pop up. Thanks for the nice example.Wornout
I get an error on trying that: ValueError: Indexes have overlapping values, although, by inspection of the individual dataframes in the list, they don't seem to have overlapping values.Santiago
M
41

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:

pd.concat(
    objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
    axis=1, 
    join='inner'
).reset_index()

where df1, df2, and df3 are defined as in John Galt's answer:

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)
Mauer answered 9/8, 2018 at 15:42 Comment(3)
This should be the accepted answer. It's the fastest.Coefficient
What if dataframe shapes are different?Andee
@AbhilashRamteke If you mean that they have different number or rows (so the name column is not the same in all data frames) then join='outer' should preserve them all, but you will have missing values. No issues with respect to different column sets, as long as they all share the name column, which is used for indexMauer
C
21

This can also be done as follows for a list of dataframes df_list:

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')
Colcothar answered 25/10, 2016 at 22:1 Comment(0)
O
12

Simple Solution:

If the column names are similar:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})
Optician answered 14/5, 2019 at 9:30 Comment(0)
G
8

One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

A tutorial may be useful.

# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
Gimp answered 15/5, 2014 at 7:26 Comment(0)
A
8

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

This is the function to merge a dict of data frames

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

OK, lets generates data and test this:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
Assorted answered 18/4, 2017 at 22:7 Comment(1)
Nice method. See correction below in MergeDfDict: keys = dfDict.keys(); i = 0; for key in keys:Laflamme
T
5

There is another solution from the pandas documentation (that I don't see here),

using the .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

If there are different column names, Nan will be introduced.

Torietorii answered 5/4, 2018 at 15:15 Comment(1)
it's semantic, for someone using the word "join" to say putting together the two dataframe. (not necessarely as the SQL join operation)Torietorii
P
2

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.

from functools import reduce 

dfs_with_suffixes = [(df2,suffix2), (df3,suffix3), 
                     (df4,suffix4)]

merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)

merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)
Pistol answered 20/11, 2020 at 10:52 Comment(1)
Tweaked approach is great; however, a small fix must be added to avoid ValueError: too many values to unpack (expected 2), a left suffices as empty string "". The final merge function could be as follow: merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=('', sfx)) # Left gets no suffix, right gets something identifiableMacbeth
P
2
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['d', 14, 16]]

),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['c', 4, 36],
    ['d', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['c', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr41', 'attr42']
)

Three ways to join list dataframe

  1. pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique 
dfs = pd.concat(dfs, join='outer', axis = 1) 
  1. functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique 
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
  1. join
# cant not run if index not unique 
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')
Parterre answered 3/8, 2022 at 8:21 Comment(0)
C
1

Joining together all three can be done using .join() function.

You have three DataFrames lets say df1, df2, df3. To join these into one DataFrame you can:

df = df1.join(df2).join(df3)

This is the simplest way I found to do this task.

Citronella answered 8/12, 2022 at 11:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.