FastAPI is very slow in returning a large amount of JSON data
Asked Answered
H

2

5

I have a FastAPI GET endpoint that is returning a large amount of JSON data (~160,000 rows and 45 columns). Unsurprisingly, it is extremely slow to return the data using json.dumps(). I am first reading the data from a file using json.loads() and filtering it per the inputted parameters. Is there a faster way to return the data to the user than using return data? It takes nearly a minute in the current state.

My code currently looks like this:

# helper function to parse parquet file (where data is stored)
def parse_parquet(file_path):
    df = pd.read_parquet(file_path)
    result = df.to_json(orient = 'records')
    parsed = json.loads(result)
    return parsed
    

@app.get('/endpoint')
# has several more parameters
async def some_function(year = int | None = None, id = str | None = None):
    if year is None:
        data = parse_parquet(f'path/{year}_data.parquet')
    # no year
    if year is not None:
        data = parse_parquet(f'path/all_data.parquet')
    if id is not None:
        data = [d for d in data if d['id'] == id]
    return data
Hagiolatry answered 1/9, 2022 at 5:51 Comment(3)
How much time does your parse_parquet function take?Parterre
@Parterre negligible time. The timing issue is on returning the data as a jsonHagiolatry
Have you seen #72221772 about how to use alternative json encoders?Tourney
G
10

One of the reasons for the response being that slow is that in your parse_parquet() method, you initially convert the file into JSON (using df.to_json()), then into dictionary (using json.loads()) and finally into JSON again, as FastAPI, behind the scenes, automatically converts the returned value into JSON-compatible data using the jsonable_encoder, and then uses the Python standard json.dumps() to serialise the object—a process that is quite slow (see this answer for more details).

As suggested by @MatsLindh in the comments section, you could use alternative JSON encoders, such as orjson or ujosn (see this answer as well), which would indeed speed up the process, compared to letting FastAPI use the jsonable_encoder and then the standard json.dumps() for converting the data into JSON. However, using pandas to_json() and returing a custom Response directly—as described in Option 1 (Update 2) of this answer—seems to be the best-performing solution. You can use the code given below—which uses a custom APIRoute class—to compare the response time for all available solutions.

Use your own parquet file or the below code to create a sample parquet file consisting of 160K rows and 45 columns.

create_parquet.py

import pandas as pd
import numpy as np

columns = ['C' + str(i) for i in range(1, 46)]
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(160000,45)),columns=columns)
df.to_parquet('data.parquet')

Run the FastAPI app below and access each endpoint separately to inspect the time taken to complete the process of loading and converting the data into JSON.

app.py

from fastapi import FastAPI, APIRouter, Response, Request
from fastapi.routing import APIRoute
from typing import Callable
import pandas as pd
import json
import time
import ujson
import orjson


class TimedRoute(APIRoute):
    def get_route_handler(self) -> Callable:
        original_route_handler = super().get_route_handler()

        async def custom_route_handler(request: Request) -> Response:
            before = time.time()
            response: Response = await original_route_handler(request)
            duration = time.time() - before
            response.headers["Response-Time"] = str(duration)
            print(f"route duration: {duration}")
            return response

        return custom_route_handler

app = FastAPI()
router = APIRouter(route_class=TimedRoute)

@router.get("/defaultFastAPIencoder")
def get_data_default():
    df = pd.read_parquet('data.parquet')   
    return df.to_dict(orient="records")
    
@router.get("/orjson")
def get_data_orjson():
    df = pd.read_parquet('data.parquet')
    return Response(orjson.dumps(df.to_dict(orient='records')), media_type="application/json")

@router.get("/ujson")
def get_data_ujson():
    df = pd.read_parquet('data.parquet')   
    return Response(ujson.dumps(df.to_dict(orient='records')), media_type="application/json")

# Preferred way  
@router.get("/pandasJSON")
def get_data_pandasJSON():
    df = pd.read_parquet('data.parquet')   
    return Response(df.to_json(orient="records"), media_type="application/json")  

app.include_router(router)

Even though the response time is quite fast using /pandasJSON above (and this should be the preferred way), you may encounter some delay on displaying the data on the browser. That, however, has nothing to do with the server side, but with the client side, as the browser is trying to display a large amount of data. If you don't want to display the data, but instead let the user download the data to their device (which would be much faster), you can set the Content-Disposition header in the Response using the attachment parameter and passing a filename as well, indicating to the browser that the file should be downloaded. For more details, have a look at this answer and this answer.

@router.get("/download")
def get_data():
    df = pd.read_parquet('data.parquet')
    headers = {'Content-Disposition': 'attachment; filename="data.json"'}
    return Response(df.to_json(orient="records"), headers=headers, media_type='application/json')

I should also mention that there is a library, called Dask, which can handle large datasets, as described here, in case you had to process a large amount of records that is taking too long to complete. Similar to Pandas, you can use the .read_parquet() method to read the file. As Dask doesn't seem to provide an equivalent .to_json() method, you could convert the Dask DataFrame to Pandas DataFrame using df.compute(), and then use Pandas df.to_json() to convert the DataFrame into a JSON string, and return it as demonstrated above.

I would also suggest you take a look at this answer, which provides details and solutions on streaming/returning a DataFrame, in case that you are dealing with a large amount of data that converting them into JSON (using .to_json()) or CSV (using .to_csv()) may cause memory issues on server side, if you opt to store the output string (either JSON or CSV) into RAM (which is the default behaviour, if you don't pass a path parameter to the aforementioned functions)—since a large amount of memory would already be allocated for the original DataFrame as well.

Grekin answered 2/9, 2022 at 8:55 Comment(0)
P
0

I guess the json.loads(result) will return a dict data type in your case, and you are filtering the dict data type. You can send the dict data type as JSON as follows:

from fastapi.responses import JSONResponse

@app.get('/endpoint')
# has several more parameters
async def some_function(year = int | None = None, id = str | None = None):
    if year is None:
        data = parse_parquet(f'path/{year}_data.parquet')
    # no year
    if year is not None:
        data = parse_parquet(f'path/all_data.parquet')
    if id is not None:
        data = [d for d in data if d['id'] == id]
    return JSONResponse(content=json_compatible_item_data)
Parterre answered 1/9, 2022 at 6:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.