I have made a local dash-app to allow students to efficiently find/study their (measurement) data. To allow for further development, I tried to make a transition from pandas & duckdb to polars. After weeks of work to integrate it into this extensive app, I realized that I have run into a major problem.
The app was stable before, but now with polars, the RAM-footprint (of the pythonw.exe process) balloons with each successive callback. While the app starts out around 100 MB; each callback adds something like 5MB. I doesn’t seem to stabilize; at 1500 MB it was still growing.
I’m sort of stuck and would really appreciate some pointers how to resolve this.
I made a minimum example to illustrate the issue. If I run it with “polars_check=True”, then I start with 98MB and after 100 iterations it has become 261 MB. If I do it with “polars_check”=False (i.e. pandas), then I start and end with 98MB.
import pathlib, os, shutil
import polars as pl, pandas as pd, numpy as np, datetime as dt
from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go
#Check-input
polars_check = True ### Whether the example returns with polars or with pandas.
if polars_check: #To accomdate the slower data retrieval with pandas.
interval_time = 3E3
else:
interval_time = 3E3
#Constants
folder = pathlib.Path(r'C:\PerovskiteCell example')
n_files = 100 #Number of files in folder
n_lines = 500000 #Number of total lines in folder
n_cols = 25
#Generating sample data in example folder (Only once).
if not folder.exists():
size = int(n_lines / n_files)
col = np.linspace(-1E3, 1E3, num=size)
df = pl.DataFrame({f'col{n}': col for n in range(n_cols)})
# Creating folder & files
os.makedirs(folder)
f_path0 = folder.joinpath('0.csv')
df.write_csv(f_path0)
for n in range(1, n_files):
shutil.copy2(f_path0, folder.joinpath(f'{n}.csv'))
#Functions
def pl_data():
"""Retrieves data via the polars route"""
lf = (pl.scan_csv(folder.joinpath(f'{n}.csv'),
schema={f'col{n}': pl.Float64 for n in range(n_cols)})
.select(pl.all().get(n)) for n in range(n_files))
lf = pl.concat(lf)
lf = lf.select('col0', 'col1')
return lf.collect()
def pd_data():
"""Retrieves data via the pandas route"""
dfs = (pd.read_csv(folder.joinpath(f'{n}.csv'), usecols=['col0', 'col1']).iloc[n:n+1]
for n in range(n_files))
return pd.concat(dfs, ignore_index=True)
#App (initialization)
app = Dash()
app.layout = html.Div([dcc.Graph(id='graph'),
dcc.Interval(id = 'check',
interval = interval_time,
max_intervals = 100)])
@app.callback(
Output('graph', 'figure'),
Input('check', 'n_intervals'))
def plot(_):
#Data retrieval
if polars_check:
df = pl_data()
else:
df = pd_data()
#Plotting
fig = go.Figure()
trace = go.Scattergl(x = list(df['col0']), y=list(df['col1']), mode='lines+markers')
fig.add_trace(trace)
fig.update_xaxes(title = str(dt.datetime.now()))
return fig
if __name__ == '__main__':
app.run(debug=False, port = 8050)
read_csv
instead ofscan_csv
? – Schoonerscan_csv
). – Gabriellegabrielli