How should I investigate a memory leak when using Google Cloud Datastore Python libraries?
Asked Answered
T

1

6

I have a web application which uses Google's Datastore, and has been running out of memory after enough requests.

I have narrowed this down to a Datastore query. A minimum PoC is provided below, a slightly longer version which includes memory measuring is on Github.

from google.cloud import datastore
from google.oauth2 import service_account

def test_datastore(entity_type: str) -> list:
    creds = service_account.Credentials.from_service_account_file("/path/to/creds")
    client = datastore.Client(credentials=creds, project="my-project")
    query = client.query(kind=entity_type, namespace="my-namespace")
    query.keys_only()
    for result in query.fetch(1):
        print(f"[+] Got a result: {result}")

for n in range(0,100):
    test_datastore("my-entity-type")

Profiling the process RSS shows approximately 1 MiB growth per iteration. This happens even if no results are returned. The following is the output from my Github gist:

[+] Iteration 0, memory usage 38.9 MiB bytes
[+] Iteration 1, memory usage 45.9 MiB bytes
[+] Iteration 2, memory usage 46.8 MiB bytes
[+] Iteration 3, memory usage 47.6 MiB bytes
..
[+] Iteration 98, memory usage 136.3 MiB bytes
[+] Iteration 99, memory usage 137.1 MiB bytes

But at the same time, Python's mprof shows a flat graph (run like mprof run python datastore_test.py):

mprof output for 100 Datastore fetches

The question

Am I doing something wrong with how I call Datastore, or is this likely an underlying problem with a library?

Environment is Python 3.7.4 on Windows 10 (also tested on 3.8 on Debian in Docker) with google-cloud-datastore==1.11.0 and grpcio==1.28.1.

Edit 1

Clarification this isn't typical Python allocator behaviour, where it requests memory from the OS but doesn't immediately free it from internal arenas / pools. Below is a graph from Kubernetes where my affected application runs:

Memory usage graph from Kubernetes, showing a linear increase in memory usage as the process runs

This shows:

  • Linear growth of memory until around 2GiB, where the application effectively crashed because it was out of memory (technically Kubernetes evicted the pod, but that is not relevant here).
  • The same web application running but no interaction with either GCP Storage or Datastore.
  • Interaction with only GCP Storage added (a very slight growth over time, potentially normal).
  • Interaction with only GCP Datastore added (much larger memory growth, approx. 512MiB in an hour). The Datastore query is exactly the same as the PoC code in this post.

Edit 2

To be absolutely sure about Python's memory usage, I checked the status of the garbage collector using gc. Before exit, the program reports:

gc: done, 15966 unreachable, 0 uncollectable, 0.0156s elapsed

I also forced garbage collection manually using gc.collect() during each iteration of the loop, which made no difference.

As there are no uncollectable objects, it seems unlikely the memory leak is coming from objects allocated using Python's internal memory management. Therefore it is more likely that an external C library is leaking memory.

Potentially related

There is an open grpc issue that I can't be sure is related, but has a number of similarities to my problem.

Tuneless answered 23/4, 2020 at 20:59 Comment(0)
T
4

I have narrowed down the memory leak to creation of the datastore.Client object.

For the following proof-of-concept code, memory usage does not grow:

from google.cloud import datastore
from google.oauth2 import service_account

def test_datastore(client, entity_type: str) -> list:
    query = client.query(kind=entity_type, namespace="my-namespace")
    query.keys_only()
    for result in query.fetch(1):
        print(f"[+] Got a result: {result}")

creds = service_account.Credentials.from_service_account_file("/path/to/creds")
client = datastore.Client(credentials=creds, project="my-project")

for n in range(0,100):
    test_datastore(client, "my-entity-type")

This makes sense for a small script where the client object can be created once and shared between requests safely.

In many other applications it's harder (or impossible) to safely pass around the client object. I'd expect the library to free memory when the client goes out of scope, otherwise this problem could arise in any long running program.

Edit 1

I have narrowed this down to grpc. The environment variable GOOGLE_CLOUD_DISABLE_GRPC can be set (to any value) to disable grpc.

Once this has been set, my application in Kubernetes looks like:

Memory usage graph from Kubernetes, showing a flat line (no memory increase) after grpc disabled

Further investigation with valgrind shows it likely relates to OpenSSL usage in grpc, which I documented in this ticket on the bug tracker.

Tuneless answered 24/4, 2020 at 13:53 Comment(2)
this is an issue on garabage collector, is a good practice re use the client object as you mention. you can call gc after a for loop in order to remove all out of scope variables, in this link are an example about GC remove all unsused variables after a loop digi.com/resources/documentation/digidocs/90001537/references/…Bitter
@JAHDZP do you have some specific links relevant to this module (or other Google client libraries)? As you can see from edit 2 to the question I have already tried gc.collect() and it makes no difference. I have also turned on gc status debugging to get the same data you suggest.Tuneless

© 2022 - 2024 — McMap. All rights reserved.