Split a large pandas dataframe
Asked Answered
A

10

193

I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

How to split this dataframe in to 4 groups?

Anne answered 26/6, 2013 at 9:1 Comment(1)
We want a np.split(df, N) function please.Pandybat
C
345

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]
Chivalric answered 26/6, 2013 at 9:7 Comment(12)
Thanks Very Much! In addition to this I want to apply some function to each group?How to access groups one by one?Anne
@NilaniAlgiriyage -- array_split returns a list of DataFrames, so you can just loop through the list...Chivalric
I'm splitting the dataframe since it is too large. I want to take the first group and apply the function, then the second group and apply function etc. so how do I access each group?Anne
@NilaniAlgiriyage -- I am not sure what do you mean. You can access each df the same way you would any list element... If you continue having problems, please ask a new question and specify the function that you are trying to use on the dfs...Chivalric
Thanks root,I have accepted your answer :)!..This is what I wanted...group = np.array_split(df, 3), group[0].apply(func)etc.Anne
How do you not get an AttributeError as Dataframe has no 'size'.Precipitin
This answer is outdated: AttributeError: 'DataFrame' object has no attribute 'size'Magnificence
Answer works great! Very tight and concise (e.g. Pythonic) for looping over chunks of data frames when doing heavy computationsRetharethink
why does this work? i realize np.split works on arrays and dataframe is an np array under the hood but how come the function returns a list of dataframes instead of arrays.. seems a bit magical?Greyso
Any ideas on the limit of rows to use the Numpy array_split method? I have a dataframe with +6m rows and would like to split it in 20 or so chunks but his method just appears to be taking forever. ThanksCreepie
@Chivalric , can i split df by ratio? for example, 6:1Histogenesis
want to split a big dataframe in to a smaller dataframes of 50 records each , how to achieve this ?Annora
P
84

I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000): 
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    return chunks
Plate answered 5/3, 2015 at 15:49 Comment(4)
much faster than using np.array_split()Latvia
The correct way to calculate numberChunks import math numberChunks = math.ceil(len(df) / chunkSize)Convey
@SergeyLeyko is correct. Otherwise you get an empty dataframe at the end of the chunks list when df size is equally divicible by chunk size. Here's an alternative for num_chunks num_chunks = len(df) // chunk_size + (1 if len(df) % chunk_size else 0)Hersch
If df.index is not consecutive, .iloc has to be used, like in this answer.Roadblock
L
49

Be aware that np.array_split(df, 3) splits the dataframe into 3 sub-dataframes, while the split_dataframe function defined in @elixir's answer, when called as split_dataframe(df, chunk_size=3), splits the dataframe every chunk_size rows.

Example:

With np.array_split:

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

...you get 3 sub-dataframes:

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

With split_dataframe:

df_split2 = split_dataframe(df, chunk_size=3)

...you get 4 sub-dataframes:

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

Hope I'm right, and that this is useful.

Lundin answered 12/7, 2017 at 10:6 Comment(5)
is there an easy way to make this process random. I can only think of adding a rondom column, splitting and removing the random column but there might be an easier wayEccrinology
do they have to be equal chunk sizes?Chloral
Thank you for your clarification. Take my upvote!Westonwestover
Thanks @Gilbertor! To which library does the module "split_dataframe" belong? I can't import it neither find a library which it belongs to...Aerometry
@Aerometry It's defined in the answer by elixir here: https://mcmap.net/q/134276/-split-a-large-pandas-dataframeHaihaida
G
34

I guess now we can use plain iloc with range for this.

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
    df_subset = df.iloc[start:start + chunk_size]
    process_data(df_subset)
    ....
Graceless answered 21/3, 2018 at 20:22 Comment(1)
Also consider this related answer which does the same using the more_itertools package.Vocable
M
11

you can use list comprehensions to do this in a single line

n = 4
chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]
Mastoid answered 12/10, 2020 at 14:37 Comment(1)
But also see the important note about COPY vs VIEW in my answer below: https://mcmap.net/q/134276/-split-a-large-pandas-dataframeDharma
B
8

Caution:

np.array_split doesn't work with numpy-1.9.0. I checked out: It works with 1.8.1.

Error:

Dataframe has no 'size' attribute

Bonaventura answered 17/11, 2014 at 16:15 Comment(1)
i filed a bug in the pandas github: github.com/pydata/pandas/issues/8846 seems it is already fixed for pandas 0.15.2Bonaventura
R
7

building on @elixir's answer...
I'd suggest using a generator to avoid loading all the chunks in memory:

def chunkit(df, chunk_size = 10000): 
    num_chunks = len(df) // chunk_size
    if len(df) % chunk_size != 0 or len(df) == 0:
        num_chunks += 1
    for i in range(num_chunks):
        yield df[i*chunk_size:(i + 1) * chunk_size]
Rode answered 18/8, 2021 at 9:24 Comment(0)
D
7

I like a one-liners, so @LucyDrops answer works for me.

However, there is one important thing: add a .copy() if chunks should be COPY of original df parts:

chunks = [df[i:i+n].copy() for i in range(0,df.shape[0],n)]

Otherwise there is a high chance to receive the next warning during the further processing of chunks (in loop for example):

A value is trying to be set on a copy of a slice from a DataFrame.

(see the details in the Pandas documentation)

Dharma answered 1/10, 2021 at 17:35 Comment(3)
From the link you posted "it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)"Impression
@JoeyBaruch, the answer has been updated. ThanksDharma
In Pandas, when warned about trying to use a copy, I've found the problem was always because a copy was NOT being used and solution is always to add a .copy(). If that dev would just change it to on a non-copy, whatever the deal is, just make it less unclear, would be appreciated, and add the currently missing line numberSwingletree
P
5

You can use groupby, assuming you have an integer enumerated index:

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

Note: groupby returns a tuple in which the 2nd element is the dataframe, thus the slightly complicated extraction.

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])
Peppery answered 21/9, 2017 at 20:40 Comment(0)
K
1

I also experienced np.array_split not working with Pandas DataFrame. My solution was to only split the index of the DataFrame and then introduce a new column with the "group" label:

indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
   df.loc[index,'group'] = i

This makes groupby operations very convenient, such as for calculating the mean value of each group:

df.groupby(by='group').mean()
Kreg answered 20/3, 2020 at 8:46 Comment(1)
Anyone using .loc over .iloc should definitely remain aware of the difference between the two.Vocable

© 2022 - 2024 — McMap. All rights reserved.