Dash & polars; RAM-use keeps increasing

G

2

10

I have made a local dash-app to allow students to efficiently find/study their (measurement) data. To allow for further development, I tried to make a transition from pandas & duckdb to polars. After weeks of work to integrate it into this extensive app, I realized that I have run into a major problem.

The app was stable before, but now with polars, the RAM-footprint (of the pythonw.exe process) balloons with each successive callback. While the app starts out around 100 MB; each callback adds something like 5MB. I doesn’t seem to stabilize; at 1500 MB it was still growing.

I’m sort of stuck and would really appreciate some pointers how to resolve this.

I made a minimum example to illustrate the issue. If I run it with “polars_check=True”, then I start with 98MB and after 100 iterations it has become 261 MB. If I do it with “polars_check”=False (i.e. pandas), then I start and end with 98MB.

import pathlib, os, shutil
import polars as pl, pandas as pd, numpy as np, datetime as dt

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go


#Check-input
polars_check = True ### Whether the example returns with polars or with pandas.

if polars_check: #To accomdate the slower data retrieval with pandas.
    interval_time = 3E3
else:
    interval_time = 3E3

#Constants
folder = pathlib.Path(r'C:\PerovskiteCell example')

n_files = 100 #Number of files in folder
n_lines = 500000 #Number of total lines in folder
n_cols = 25


#Generating sample data in example folder (Only once).
if not folder.exists():

    size = int(n_lines / n_files)
    col = np.linspace(-1E3, 1E3, num=size)

    df = pl.DataFrame({f'col{n}': col for n in range(n_cols)})

    # Creating folder & files
    os.makedirs(folder)

    f_path0 = folder.joinpath('0.csv')
    df.write_csv(f_path0)

    for n in range(1, n_files):
        shutil.copy2(f_path0, folder.joinpath(f'{n}.csv'))


#Functions
def pl_data():
    """Retrieves data via the polars route"""
    
    lf = (pl.scan_csv(folder.joinpath(f'{n}.csv'),
                                schema={f'col{n}': pl.Float64 for n in range(n_cols)})
                            
                            .select(pl.all().get(n)) for n in range(n_files))
    
    lf = pl.concat(lf)
    lf = lf.select('col0', 'col1')

    return lf.collect()


def pd_data():
    """Retrieves data via the pandas route"""

    dfs = (pd.read_csv(folder.joinpath(f'{n}.csv'), usecols=['col0', 'col1']).iloc[n:n+1]
                                         for n in range(n_files))
    
    return pd.concat(dfs, ignore_index=True)



#App (initialization)
app = Dash()
app.layout = html.Div([dcc.Graph(id='graph'),
                        dcc.Interval(id = 'check', 
                                        interval = interval_time,
                                        max_intervals = 100)])


@app.callback(
    Output('graph', 'figure'),
    Input('check', 'n_intervals'))

def plot(_):

    #Data retrieval
    if polars_check:
        df = pl_data()
    else:
        df = pd_data()

    #Plotting
    fig = go.Figure()
    trace = go.Scattergl(x = list(df['col0']), y=list(df['col1']), mode='lines+markers')

    fig.add_trace(trace)
    fig.update_xaxes(title = str(dt.datetime.now()))

    return fig


if __name__ == '__main__':
    app.run(debug=False, port = 8050)

Gabriellegabrielli answered 5/8, 2024 at 11:50 Comment(4)

Dash uses flask under the hood, maybe this can help #49991734 – Rossy 5/8, 2024 at 12:51

Have you tried using read_csv instead of scan_csv? – Schooner 5/8, 2024 at 13:25

@TimsibAdnap Yes. The memory-use still grows but slower. After 100 intervals, it is 163MB (instead of 261MB using scan_csv). – Gabriellegabrielli 5/8, 2024 at 13:49

@AvitanD : Sorry for the tag, I think your experience may help here? – Rivet 16/8, 2024 at 14:8

G

2

I found the answer at the polars-github. Just add os.environ['MIMALLOC_ABANDONED_PAGE_RESET'] = '1' before importing polars.

Gabriellegabrielli answered 19/8, 2024 at 8:1 Comment(0)

D

0

You can try to use python package of garbage collector to empty the unused ram. Hope it helps.

gc.enable()
gc.collect(generation=2)

Derosier answered 13/8, 2024 at 9:17 Comment(0)

Recommended topics

Hot tags