How do I debug OverflowError: value too large to convert to int32_t?
Asked Answered
H

1

3

What I am trying to do

I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job) so I am chunk-reading through the files.

This is how the function I use to generate Arrow tables looks like (snippet for brevity):

import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import csv as arrow_csv

def generate_arrow_tables(
        input_buffer: pa.lib.Buffer,
        arrow_schema: pa.Schema,
        batch_size: int
) -> Generator[pa.Table, None, None]:
    """
    Generates an Arrow Table from given data.
    :param batch_size: Size of batch streamed from CSV at a time
    :param input_buffer: Takes in an Arrow BufferOutputStream
    :param arrow_schema: Takes in an Arrow Schema
    :return: Returns an Arrow Table
    """

    # Preparing convert options
    co = arrow_csv.ConvertOptions(column_types=arrow_schema, strings_can_be_null=True)

    # Preparing read options
    ro = arrow_csv.ReadOptions(block_size=batch_size)
    # Streaming contents of CSV into batches
    with arrow_csv.open_csv(input_buffer, convert_options=co, read_options=ro) as stream_reader:
        for chunk in stream_reader:
            if chunk is None:
                break

            # Emit batches from generator. Arrow schema is inferred unless explicitly specified
            yield pa.Table.from_batches(batches=[chunk], schema=arrow_schema)

And this is how I use the function to write the batches to S3 (snippet for brevity):

GB = 1024 ** 3
# data.size here is the size of the buffer
arrow_tables: Generator[Table, None, None] = generate_arrow_tables(pg_data, arrow_schema, min(data.size, GB ** 10))
# Iterate through generated tables and write to S3
count = 0
for table in arrow_tables:
    count += 1  # Count based on batch size

    # Write keys to S3
    file_name = f'{ARGS.run_id}-{count}.parquet'
    write_to_s3(table, output_path=f"s3://{bucket}/{bucket_prefix}/{file_name}")

What is going wrong

I am getting the following error OverflowError: value too large to convert to int32_t here is the stack trace(snippet for brevity):

[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b'    ro = arrow_csv.ReadOptions(block_size=batch_size)\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b'  File "pyarrow/_csv.pyx", line 87, in pyarrow._csv.ReadOptions.__init__\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b'  File "pyarrow/_csv.pyx", line 119, in pyarrow._csv.ReadOptions.block_size.__set__\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b'OverflowError: value too large to convert to int32_t\n'

How can I debug this issue and/or fix it?

I am happy to provide additional information if needed

Halliehallman answered 4/8, 2021 at 13:29 Comment(1)
GB ** 10 is a very big number, I think you meant GB * 10Voe
B
2

If I understand correctly the third argument to generate_arrow_tables is batch_size which you are passing in as the block_size to the CSV reader. I'm not sure what data.size's value is but you are guarding it with min(data.size, GB ** 10).

A block_size of 10GB will not work. The error you are receiving is that the block size does not fit in a signed 32 bit integer (max ~2GB).

Besides that limit, I'm not sure it is a good idea to use a block size much bigger than the default (1MB). I wouldn't expect that you'd see much performance benefit and you will end up using far more RAM than you need to.

Balneal answered 4/8, 2021 at 17:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.