Pandas to excel using XLSX Writer - Constant Memory/In Memory
Asked Answered
M

3

5

I am using pandas and xlsxwriter to create excel reports in AWS lambda. I have been having an issue where the disk space fills up when a reasonably large excel file in the region of 60MB is created, but there is 512MB of disk space available.

I have been trying to work out what and why this is happening and I came across a promising looking property I could pass, constant_memory, so I tried this out using the following code:

with pd.ExcelWriter(output, options={"constant_memory": True}) as writer:

It seems this actually ensures my lambda runs to completion consistently and if I remove it then it no longer does, but the issue is I get the following warning:

FutureWarning: Use of **kwargs is deprecated, use engine_kwargs instead.
    with pd.ExcelWriter(output, options={"constant_memory": True}) as writer:

I then read a post on SO, which I think is by the author of xlsxwriter suggesting this property doesn't actually work when using pandas. This is sort of evident in that I only get the first column in my excel report and the last row of 360000 rows. I am confused as to how passing this is ensuring my lambda completes0? And why is it that if I do not pass it then my lambda runs out of disk space?

The second strange thing is the documentation for pandas doesn't mention any argument called options at all, and if I try and pass constant_memory using engine_kwargs:

with pd.ExcelWriter(output, engine_kwargs={'constant_memory': True}) as writer:

I get the following error:

TypeError: __init__() got an unexpected keyword argument 'constant_memory'

I hope someone who is more familiar with xlsxwriter can help me out in understanding these few points.

The documentation for xlsxwriter also mentions a property in_memory, which seems like it could resolve my issue but I am unable to pass this to pd.ExcelWriter:

TypeError: __init__() got an unexpected keyword argument 'in_memory'

**Edit: Passing in_memory does actually solve the issues I have but I get a deprecation warning passing it in using the kwargs options.

with pd.ExcelWriter(output, options={"in_memory": True}) as writer:

FutureWarning: Use of **kwargs is deprecated, use engine_kwargs instead.
    with pd.ExcelWriter(output, options={"in_memory": True}) as writer:

How can I pass this property to the xlsxwriter using pandas without getting the deprecation warning as it seems to resolve my issues?

Metapsychology answered 23/7, 2021 at 21:11 Comment(0)
P
4

The correct syntax to pass xlsxwriter options to Pandas when you get that deprecation notice is:

writer = pd.ExcelWriter('pandas_example.xlsx',
                        engine='xlsxwriter',
                        engine_kwargs={'options': {'strings_to_numbers': True}})

See this section of the XlsxWriter docs.

However, as you pointed our in your question, and from your observation, the constant_memory option won't work with Pandas since it requires data to be written in a row by row order but Pandas uses a column by column order.

But also, just to be clear, the constant_memory option will only reduce the "memory" used by your application. It won't make the size of the file smaller (in fact it generally makes the file bigger if it contains a lot of string data). It may only by "solving" your issue because it is only writing one column of data to the file, and thus the file is much smaller than intended.

Pennywise answered 18/8, 2021 at 10:49 Comment(3)
I actually found that constant_memory didnt solve my issue but in_memory does, thanks for the pointer towards the incorrect syntax, not sure how I missed that.Metapsychology
You probably missed it because I only updated the docs last week, after I started to see several questions in relation to that deprecation notice. :-)Pennywise
Also, if in_memory solved your problem then it was probably because the temp files that xlsxwriter uses was taking some of your disk quota that you couldn't see.Pennywise
D
4

I'm not an expert, anyway, have you tried

pd.ExcelWriter(output, engine_kwargs={'options': {"constant_memory": True}})

?

Drusilladrusus answered 18/8, 2021 at 9:53 Comment(0)
P
4

The correct syntax to pass xlsxwriter options to Pandas when you get that deprecation notice is:

writer = pd.ExcelWriter('pandas_example.xlsx',
                        engine='xlsxwriter',
                        engine_kwargs={'options': {'strings_to_numbers': True}})

See this section of the XlsxWriter docs.

However, as you pointed our in your question, and from your observation, the constant_memory option won't work with Pandas since it requires data to be written in a row by row order but Pandas uses a column by column order.

But also, just to be clear, the constant_memory option will only reduce the "memory" used by your application. It won't make the size of the file smaller (in fact it generally makes the file bigger if it contains a lot of string data). It may only by "solving" your issue because it is only writing one column of data to the file, and thus the file is much smaller than intended.

Pennywise answered 18/8, 2021 at 10:49 Comment(3)
I actually found that constant_memory didnt solve my issue but in_memory does, thanks for the pointer towards the incorrect syntax, not sure how I missed that.Metapsychology
You probably missed it because I only updated the docs last week, after I started to see several questions in relation to that deprecation notice. :-)Pennywise
Also, if in_memory solved your problem then it was probably because the temp files that xlsxwriter uses was taking some of your disk quota that you couldn't see.Pennywise
R
0

I'm not an expert, anyway, have you tried

using xlsxwriter directly instead of pandas

import xlsxwriter

workbook = xlsxwriter.Workbook('output.xlsx', {'strings_to_numbers': True, 'constant_memory': True, 'encoding': 'utf-8'})
worksheet = workbook.add_worksheet()

worksheet.write('A1', 'hello world')

workbook.close()
Rear answered 18/9, 2023 at 7:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.