If you are enlarging a dataframe in a loop using DataFrame.append
or concat
or loc
, consider rewriting your code to enlarge a Python list and construct a dataframe once. Sometimes, you may not even need pd.concat
, you may just need a DataFrame constructor on a list of dicts.
A pretty common example of appending new rows to a dataframe is scraping data from a webpage and storing them a dataframe. In that case, instead of appending to a dataframe, literally just replace dataframe with a list and call pd.DataFrame()
or pd.concat
once at the end once. An example:
So instead of:
df = pd.DataFrame() # <--- initial dataframe (doesn't have to be empty)
for url in ticker_list:
data = pd.read_csv(url)
df = df.append(data, ignore_index=True) # <--- enlarge dataframe
use:
lst = [] # <--- initial list (doesn't have to be empty;
for url in ticker_list: # could store the initial df)
data = pd.read_csv(url)
lst.append(data) # <--- enlarge list
df = pd.concat(lst) # <--- concatenate the frames
Data reading logic could be response data from an API, data scraped from a webpage, whatever, the code refactoring is really minimal. In the above example, we assumed that lst
is a list of dataframes but if it were a list of dicts/lists etc. then we could use df = pd.DataFrame(lst)
instead in the last line of code.
That said, if a single row is to be appended to a dataframe, loc
could also do the job.
df.loc[len(df)] = new_row
With the loc
call, the dataframe is enlarged with index label len(df)
, which makes sense only if the index is RangeIndex
; RangeIndex
is created by default if an explicit index is not passed to the dataframe constructor.
A working example:
df = pd.DataFrame({'A': range(3), 'B': list('abc')})
df.loc[len(df)] = [4, 'd']
df.loc[len(df)] = {'A': 5, 'B': 'e'}
df.loc[len(df)] = pd.Series({'A': 6, 'B': 'f'})
As pointed out by @mozway, enlarging a pandas dataframe has O(n^2) complexity because in each iteration, the entire dataframe has to be read and copied. The following perfplot shows the runtime difference relative to concatenation done once.1 As you can see, both ways to enlarge a dataframe are much, much slower than enlarging a list and constructing a dataframe once (e.g. for a dataframe with 10k rows, concat
in a loop is about 800 times slower and loc
in a loop is about 1600 times slower).
1 The code used to produce the perfplot:
import pandas as pd
import perfplot
def concat_loop(lst):
df = pd.DataFrame(columns=['A', 'B'])
for dic in lst:
df = pd.concat([df, pd.DataFrame([dic])], ignore_index=True)
return df.infer_objects()
def concat_once(lst):
df = pd.DataFrame(columns=['A', 'B'])
df = pd.concat([df, pd.DataFrame(lst)], ignore_index=True)
return df.infer_objects()
def loc_loop(lst):
df = pd.DataFrame(columns=['A', 'B'])
for dic in lst:
df.loc[len(df)] = dic
return df
perfplot.plot(
setup=lambda n: [{'A': i, 'B': 'a'*(i%5+1)} for i in range(n)],
kernels=[concat_loop, concat_once, loc_loop],
labels= ['concat in a loop', 'concat once', 'loc in a loop'],
n_range=[2**k for k in range(16)],
xlabel='Length of dataframe',
title='Enlarging a dataframe in a loop',
relative_to=1,
equality_check=pd.DataFrame.equals);
new_row
for clarity. You might need to usepd.DataFrame([new_row])
orpd.DataFrame(new_row)
depending on the format. – Jabalpur