Amazon aurora postgres serverless: Database returned more than the allowed response size limit
Asked Answered
L

4

18

We are exploring aurora serverless as a database storage for one of our project. While doing the POC to fetch events from aurora serverless cluster we are receiving "Database returned more than the allowed response size limit" exception.

  1. Is there an in build solution for this? Is there a token mechanism similar to dynamoDB in aurora serverless?
  2. Fetching 50,000 quickly is one of our key requirement. Should we use aurora or aurora serverless?

More details about database: Aurora PostgreSQL server less.

For doing POC We are using data APIs, which have these limitations.

import boto3;

client = boto3.client('rds-data')

import sys;
sql_statement = "select * from table_name limit 1000"

response = client.execute_statement(  
    database='mydb',  
    secretArn='<secret_arn',  
    resourceArn='<resource_arn>',  
    sql=sql_statement  
) 

print(response);
Luralurch answered 13/1, 2020 at 7:31 Comment(0)
T
15

When you do a query in any Aurora Serverless instance through the Data API, you have two size limitations:

  1. Each returned row cannot be greater than 64 KB; (source)
  2. The result set cannot be greater than 1 MB. (source)

Currently, there are no solutions in the Data API to overcome these limits. In the Amazon Aurora User Guide, there is a recommended solution for the second problem in p.164:

In this case, the size of the result set returned by the database was too large. The Data API limit is 1 MB in the result set returned by the database. To solve this issue, make sure that calls to the Data API return 1 MB of data or less. If you need to return more than 1 MB, you can use multiple ExecuteStatement calls with the LIMIT clause in your query.

Considering it, you could do an exponential backoff until you find an acceptable LIMIT for your result set, or set a fixed LIMIT that you are comfortable that it will be always lower than 1 MB, even if your rows size increase in the future.

After defining how to set your LIMIT clause value (depending on whether your cluster uses MySQL 5.6, MySQL 5.7 or PostgreSQL 10.7), you could do a COUNT query to know how many results you will get, and then iterate until you execute COUNT / LIMIT statements. Another option would be to iterate until your statement response has fewer rows than your LIMIT.

Testicle answered 15/7, 2020 at 17:59 Comment(2)
Source for the row and result set limits: docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/…Tevere
why such a low 1MB limitation here?Conquistador
N
1

For some reason, pandas read_sql's "chunksize" argument doesn't play well with the AWS Data API. I am using MySQL by the way, but the pydataapi should support both.

The solution I have found for this issue is to iteratively query the database using pandas' read_sql with string formatted LIMITs. This uses the pydataapi python package.

I first got the length of the table using the syntax

df_status = pd.read_sql('show table status like "yourtable";', con = sql_engine)

Then use numpy.arange() to determine starting row for request. Each query will ask for a number of rows designated by 'chunksize' parameter here. I've chosen 20,000 but you should choose the highest value that doesn't lead to a response > 1MB.

Then it's just a matter of concatenating the list of dataframes. I've found this to be the most optimal solution for now.

    import pandas as pd
    import numpy as np
    import boto3
    from sqlalchemy import create_engine

    cluster_arn = 'yourcluster'
    secret_arn = 'yoursecret'
    database = 'mydb'
    chunksize = 20000


    rdsData = boto3.client('rds-data', region_name = 'yourregion')
    sql_engine = create_engine('mysql+pydataapi://',
                               connect_args = {
                                   'resource_arn': cluster_arn,
                                   'secret_arn': secret_arn,
                                   'database': database,
                                   'client': rdsData}).connect()
    df_status = pd.read_sql("show table status like 'yourtable';", con = sql_engine)
    rownum = df_status.loc[0, "Rows"]
    space = np.arange(0, rownum, chunksize)
    space = space.tolist()
    space.append(rownum)
    df_list = []
    for i in space:
        df = pd.read_sql("select * from yourtable LIMIT {}, {}".format(i, chunksize), con = sql_engine)
        df_list.append(df)
    big_df = pd.concat(df_list)
    sql_engine.close()
Nett answered 15/12, 2020 at 22:45 Comment(0)
D
0

https://github.com/cloud-utils/aurora-data-api handles this problem

quick example of get all query using library

with aurora_data_api.connect(aurora_cluster_arn=cluster_arn, secret_arn=secret_arn, database="cool_db_name_here") as conn:
    with conn.cursor() as cursor:
        cursor.execute("select * from cool_table")
        data = cursor.fetchall()
Dobby answered 28/7, 2021 at 15:22 Comment(0)
H
0

I want to share another example of using pandas' read_sql as mentioned by jameshgrn. I try to make it shorter and more readable. It also allows for flexibility in the query by counting the rows instead of using the table status command. However, this method may be slower if the table has a large number of rows because it requires a count(*) query.

chunksize = 10000
with engine.connect() as conn:
    result = conn.execute("select count(*) from your_tbl").fetchone()
    row_num = result[0]
    chunks = [
        pd.read_sql("select * from your_tbl limit {} offset {}".format(chunksize, i), con = conn)
        for i in range(0, row_num, chunksize)
    ]
    data_df = pd.concat(chunks, ignore_index=True)
Hyperbola answered 7/3, 2023 at 9:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.