How to concatenate multiple pandas.DataFrames without running into MemoryError
Asked Answered
G

11

33

I have three DataFrames that I'm trying to concatenate.

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError. How can I resolve this?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.

Garrygarson answered 23/6, 2017 at 7:20 Comment(8)
are those time series? do you want to concat them on dates?Antilles
I want to concat on the index. It's not a time series.Garrygarson
Have you added a bounty because you do not want to write files?Glovsky
@Glovsky just want wanted to draw more attention to the question, and see if writing to a csv was the only option, or if there was a more elegant solution.Garrygarson
Well, my only other idea was to do like JohnE suggests in his answer...Glovsky
What are columns types? conversion may be useful in this caseJoaquin
@bluprince : 'I want to concat on the index' seems to be in conflict with pd.concat([df1, df2, df3]) , which concatenate on columns. Have your df same number of rows or same number of columns ?Jaquith
@B.M. same number of columns.Garrygarson
G
4

I'm grateful to the community for their answers. However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.

There are memory limits defined for Windows 32 and 64 bit OS. For a 32 bit process, it is only 2 GB. So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.

I upgraded to 64 bit Python, and haven't had a memory error since then!

Other relevant questions are: Python 32-bit memory limits on 64bit windows, Should I use Python 32bit or Python 64bit, Why is this numpy array too big to load?

Garrygarson answered 22/10, 2017 at 19:56 Comment(0)
J
42

The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

With such huge data, performance is an issue.

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

Here the code:

import numpy as np
import pandas as pd
import os
import pickle

# a DataFrame factory:
dfs=[]
for i in range(10):
    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))
    
# a csv solution
def bycsv(dfs):
    md,hd='w',True
    for df in dfs:
        df.to_csv('df_all.csv',mode=md,header=hd,index=None)
        md,hd='a',False
    #del dfs
    df_all=pd.read_csv('df_all.csv',index_col=None)
    os.remove('df_all.csv') 
    return df_all    

    

Better solutions :

def byHDF(dfs):
    store=pd.HDFStore('df_all.h5')
    for df in dfs:
        store.append('df',df,data_columns=list('0123'))
    #del dfs
    df=store.select('df')
    store.close()
    os.remove('df_all.h5')
    return df

def bypickle(dfs):
    c=[]
    with open('df_all.pkl','ab') as f:
        for df in dfs:
            pickle.dump(df,f)
            c.append(len(df))    
    #del dfs
    with open('df_all.pkl','rb') as f:
        df_all=pickle.load(f)
        offset=len(df_all)
        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))
        
        for size in c[1:]:
            df=pickle.load(f)
            df_all.iloc[offset:offset+size]=df.values 
            offset+=size
    os.remove('df_all.pkl')
    return df_all
    

For homogeneous dataframes, we can do even better :

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s

In [93]: %time x=byHDF(dfs)
Wall time: 547 ms

In [94]: %time v=bypickle(dfs)
Wall time: 219 ms

In [95]: %time y=byhand(dfs)
Wall time: 109 ms

A check :

In [195]: (x.values==w.values).all()
Out[195]: True

In [196]: (x.values==v.values).all()
Out[196]: True

In [197]: (x.values==y.values).all()
Out[196]: True


            

Of course all of that must be improved and tuned to fit your problem.

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

I can edit it if you give more information on your data structure and size if you want. Beautiful question !

Jaquith answered 4/7, 2017 at 16:40 Comment(7)
I tried to use the solution byhand and received an error: cannot create an OBJECT array from memory buffer. I'm not sure that it could be fixed in Python3.Puton
Great post! What would you need to do to concatenate these by column instead?Diatribe
Awesome comparison and very informative, thank you!Dilator
None of these, except for bycsv, seem to work in the latest version of the libraries.Crepitate
@Crepitate I missed import os import pickle . I add them.Jaquith
these solutions are just great! What if I don't know my columns and the dfs have different columns? Any efficient solution where we don't need to specify the columns in advance and they can be different in different dataframes (NaNs go for missing data)?Cornwallis
PS I also have problems with the last versions of the libraries (excluding the CSV solution which works).Cornwallis
O
23

I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

Execute that:

# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', index=False)

# free memory
del df1, df2, df3

# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')

If this solution isn't enougth performante, to concat larger files than usually. Do:

df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)

del df1, df2, df3

Then run bash command:

cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv

Or concat csv files in python :

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

After read:

df = pd.read_csv('file.csv')
Orelle answered 23/6, 2017 at 15:7 Comment(5)
But if we want to concatenate along columns i.e. axis=1, then your answer will not work!Omar
No for big files pandas raises MemoryErrorOrelle
Currently you should use header=False instead of columns=FalseGile
@AbhilashAwasthi the paste command may be a better option after dumping files on disk.Raff
last option does not work " AttributeError: 'str' object has no attribute 'read'"Glennisglennon
E
11

Kinda taking a guess here, but maybe:

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

concat_df = pd.DataFrame()
for i in range(3):
    temp_df = pd.read_csv('foo'+str(i)+'.csv')
    concat_df = pd.concat( [concat_df, temp_df] )

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.

Ebb answered 28/6, 2017 at 23:41 Comment(0)
A
8

Similar to what @glegoux suggests, also pd.DataFrame.to_csv can write in append mode, so you can do something like:

df1.to_csv(filename)
df2.to_csv(filename, mode='a')
df3.to_csv(filename, mode='a')

del df1, df2, df3
df_concat = pd.read_csv(filename)
Adminicle answered 28/6, 2017 at 10:37 Comment(0)
F
5

Dask might be good option to try for handling large dataframes - Go through Dask Docs

Fictile answered 3/7, 2017 at 9:58 Comment(0)
G
4

I'm grateful to the community for their answers. However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.

There are memory limits defined for Windows 32 and 64 bit OS. For a 32 bit process, it is only 2 GB. So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.

I upgraded to 64 bit Python, and haven't had a memory error since then!

Other relevant questions are: Python 32-bit memory limits on 64bit windows, Should I use Python 32bit or Python 64bit, Why is this numpy array too big to load?

Garrygarson answered 22/10, 2017 at 19:56 Comment(0)
T
3

You can store your individual dataframes in a HDF Store, and then call the store just like one big dataframe.

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)
Tarweed answered 1/7, 2017 at 16:35 Comment(0)
M
3

I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.

My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed. This will bring the runtime to almost half.

Melesa answered 4/7, 2017 at 13:45 Comment(2)
This is the solution I'm looking for. I have 363 DataFames with 100K rows each that I need to read in one at a time to concatenate into one large DataFrame for efficient processing. Currently I've coded JohnE's solution (read, concat to growing DataFrame, loop), and that starts out fast but gets slower for each concat. A code example for this solution would be nice, but I can figure it out. Thanks for letting me know it cut the time in half.Hen
Update: I switched from a DataFrame-growing concat-loop to a list-append-loop + final concat and my execution time went from 9 minutes down to 2.5 minutes.Hen
L
2

Another option:

1) Write df1 to .csv file: df1.to_csv('Big file.csv')

2) Open .csv file, then append df2:

with open('Big File.csv','a') as f:
    df2.to_csv(f, header=False)

3) Repeat Step 2 with df3

with open('Big File.csv','a') as f:
    df3.to_csv(f, header=False)
Logan answered 29/6, 2017 at 19:58 Comment(0)
R
0

While writing to hard disk, df.to_csv throws an error for columns=False.

The below solutions works fine:

# write df1 to hard disk as file.csv
train1.to_csv('file.csv', index=False)
# append df2 to file.csv
train2.to_csv('file.csv', mode='a', header=False, index=False)
# read the appended csv as df
train = pd.read_csv('file.csv')
Reneta answered 16/11, 2020 at 12:39 Comment(0)
C
0

This function converts columns' dtype to a minimum possible to fit any value in that column. It can reduce the size even more than 10x in some cases. After this a loop to append all of dataframes to a list, a pd.concat for one big file and at the end delete the dataframe list to free memory.

# Function to reduce memory usage
def reduce_size(df):
# Define lists of dtypes
cat_cols = df.select_dtypes(exclude=[np.number, np.datetime64]).columns 
num_cols = df.select_dtypes(include=[np.number]).columns

# Convert non_numericals to category
df[cat_cols] = df[cat_cols].astype('category')

# Convert numericals to the lowest dtype
for col in num_cols:
    df[col] = pd.to_numeric(df[col], downcast='float')
    # Try to make them integer
    df[col] = pd.to_numeric(df[col], downcast='integer')  
return df
Canst answered 19/2 at 13:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.