Django multiprocessing and database connections
Asked Answered
P

13

97

Background:

I'm working a project which uses Django with a Postgres database. We're also using mod_wsgi in case that matters, since some of my web searches have made mention of it. On web form submit, the Django view kicks off a job that will take a substantial amount of time (more than the user would want to wait), so we kick off the job via a system call in the background. The job that is now running needs to be able to read and write to the database. Because this job takes so long, we use multiprocessing to run parts of it in parallel.

Problem:

The top level script has a database connection, and when it spawns off child processes, it seems that the parent's connection is available to the children. Then there's an exception about how SET TRANSACTION ISOLATION LEVEL must be called before a query. Research has indicated that this is due to trying to use the same database connection in multiple processes. One thread I found suggested calling connection.close() at the start of the child processes so that Django will automatically create a new connection when it needs one, and therefore each child process will have a unique connection - i.e. not shared. This didn't work for me, as calling connection.close() in the child process caused the parent process to complain that the connection was lost.

Other Findings:

Some stuff I read seemed to indicate you can't really do this, and that multiprocessing, mod_wsgi, and Django don't play well together. That just seems hard to believe I guess.

Some suggested using celery, which might be a long term solution, but I am unable to get celery installed at this time, pending some approval processes, so not an option right now.

Found several references on SO and elsewhere about persistent database connections, which I believe to be a different problem.

Also found references to psycopg2.pool and pgpool and something about bouncer. Admittedly, I didn't understand most of what I was reading on those, but it certainly didn't jump out at me as being what I was looking for.

Current "Work-Around":

For now, I've reverted to just running things serially, and it works, but is slower than I'd like.

Any suggestions as to how I can use multiprocessing to run in parallel? Seems like if I could have the parent and two children all have independent connections to the database, things would be ok, but I can't seem to get that behavior.

Thanks, and sorry for the length!

Palenque answered 23/11, 2011 at 13:18 Comment(1)
Also see this discussionGroyne
R
82

Multiprocessing copies connection objects between processes because it forks processes, and therefore copies all the file descriptors of the parent process. That being said, a connection to the SQL server is just a file, you can see it in linux under /proc//fd/.... any open file will be shared between forked processes. You can find more about forking here.

My solution was just simply close db connection just before launching processes, each process recreate connection itself when it will need one (tested in django 1.4):

from django import db
db.connections.close_all()
def db_worker():      
    some_paralell_code()
Process(target = db_worker,args = ())

Pgbouncer/pgpool is not connected with threads in a meaning of multiprocessing. It's rather solution for not closing connection on each request = speeding up connecting to postgres while under high load.

Update:

To completely remove problems with database connection simply move all logic connected with database to db_worker - I wanted to pass QueryDict as an argument... Better idea is simply pass list of ids... See QueryDict and values_list('id', flat=True), and do not forget to turn it to list! list(QueryDict) before passing to db_worker. Thanks to that we do not copy models database connection.

def db_worker(models_ids):        
    obj = PartModelWorkerClass(model_ids) # here You do Model.objects.filter(id__in = model_ids)
    obj.run()


model_ids = Model.objects.all().values_list('id', flat=True)
model_ids = list(model_ids) # cast to list
process_count = 5
delta = (len(model_ids) / process_count) + 1

# do all the db stuff here ...

# here you can close db connection
from django import db
db.connections.close_all()

for it in range(0:process_count):
    Process(target = db_worker,args = (model_ids[it*delta:(it+1)*delta]))   
Reata answered 21/5, 2012 at 11:47 Comment(8)
could you explain that bit about the passing of ID's from a queryset to a self answered question?Shakiashaking
multiprocessing copies connection objects between processes because it forks processes, and therefore copies all the file descriptors of the parent process. That being said, a connection to the mysql server is just a file, you can see it in linux under /proc/<PID>/fd/.... any open file will be shared between forked processes AFAIK. #4277789Maidamaidan
Does that apply to threads as well? Eg. close db conn in main thread, then access db in each thread, will each thread get its own connection?Hyo
You should use django.db.connections.close_all() to close all the connections with one call.Drusilla
@Dejell yeah, You can also do that inside db worker and it should work... Why I chose other way - I do not remeber exactly - probably to not have to remmeber to close connection in each db_worker function - I have more than one in my use case.Reata
Thanks. I am still missing - if this is a file, and multiple processes are using the same file, shouldn't they share the same connection? so if one process opened a connection, django will use that one in another processDrippy
Hm... Here is quite interesting talk between folks from django: code.djangoproject.com/ticket/20562 maybe it will shed some light on this topic? Basically connections 'are not forkable'... Each process should have it own connection.Reata
Should I close connection also when using ‘multiprocessing.apply_async’?Wawro
R
20

When using multiple databases, you should close all connections.

from django import db
for connection_name in db.connections.databases:
    db.connections[connection_name].close()

EDIT

Please use the same as @lechup mentionned to close all connections(not sure since which django version this method was added):

from django import db
db.connections.close_all()
Reindeer answered 26/1, 2014 at 22:2 Comment(3)
this is just calling db.close_connection multiple timesPterous
I don't see how this can work without using alias or info anywhere.Hayleyhayloft
This... can't work. @Mounir, you should modify it to use alias or info in the for loop body, if db or close_connection() supports that.Magruder
R
8

For Python 3 and Django 1.9 this is what worked for me:

import multiprocessing
import django
django.setup() # Must call setup

def db_worker():
    for name, info in django.db.connections.databases.items(): # Close the DB connections
        django.db.connection.close()
    # Execute parallel code here

if __name__ == '__main__':
    multiprocessing.Process(target=db_worker)

Note that without the django.setup() I could not get this to work. I am guessing something needs to be initialized again for multiprocessing.

Repartition answered 13/7, 2016 at 15:57 Comment(3)
Thanks! This worked for me and probably should be the accepted answer now for newer versions of django.Kearse
The django way is to create management command not create standalone wrapper script. If You do not use management command You need to use setup of django.Reata
Your for loop isn't actually doing anything with db.connections.databases.items() - it's just closing the connection several times. db.connections.close_all() works fine as long as it's called the worker function.Lucier
E
6

I had "closed connection" issues when running Django test cases sequentially. In addition to the tests, there is also another process intentionally modifying the database during test execution. This process is started in each test case setUp().

A simple fix was to inherit my test classes from TransactionTestCase instead of TestCase. This makes sure that the database was actually written, and the other process has an up-to-date view on the data.

Exoteric answered 17/10, 2018 at 4:23 Comment(1)
Works nicely on Linux, but does not seem to work on WindowsSharp
P
4

Another way around your issue is to initialise a new connection to the database inside the forked process using:

from django.db import connection    
connection.connect()
Pink answered 19/8, 2021 at 13:49 Comment(0)
B
1

(not a great solution, but a possible workaround)

if you can't use celery, maybe you could implement your own queueing system, basically adding tasks to some task table and having a regular cron that picks them off and processes? (via a management command)

Balinese answered 23/11, 2011 at 13:30 Comment(3)
possibly - was hoping to avoid that level of complexity, but if its the only solution, then I may have to go down that road - thanks for the suggestion. Is celery the best answer? if so, I may be able to push to get it, but it will take a while. I associate celery with distributed processing as opposed to parallel processing on one machine, but maybe that's just my lack of experience with it..Palenque
celery is a good fit for any processing required outside the request-response cycleBalinese
Polling is fine if tasks are not in a hurry. But you will have to rewrite everything if requirements change just a little.Seagraves
T
1

Hey I ran into this issue and was able to resolve it by performing the following (we are implementing a limited task system)

task.py

from django.db import connection

def as_task(fn):
    """  this is a decorator that handles task duties, like setting up loggers, reporting on status...etc """ 
    connection.close()  #  this is where i kill the database connection VERY IMPORTANT
    # This will force django to open a new unique connection, since on linux at least
    # Connections do not fare well when forked 
    #...etc

ScheduledJob.py

from django.db import connection

def run_task(request, job_id):
    """ Just a simple view that when hit with a specific job id kicks of said job """ 
    # your logic goes here
    # ...
    processor = multiprocessing.Queue()
    multiprocessing.Process(
        target=call_command,  # all of our tasks are setup as management commands in django
        args=[
            job_info.management_command,
        ],
        kwargs= {
            'web_processor': processor,
        }.items() + vars(options).items()).start()

result = processor.get(timeout=10)  # wait to get a response on a successful init
# Result is a tuple of [TRUE|FALSE,<ErrorMessage>]
if not result[0]:
    raise Exception(result[1])
else:
   # THE VERY VERY IMPORTANT PART HERE, notice that up to this point we haven't touched the db again, but now we absolutely have to call connection.close()
   connection.close()
   # we do some database accessing here to get the most recently updated job id in the database

Honestly, to prevent race conditions (with multiple simultaneous users) it would be best to call database.close() as quickly as possible after you fork the process. There may still be a chance that another user somewhere down the line totally makes a request to the db before you have a chance to flush the database though.

In all honesty it would likely be safer and smarter to have your fork not call the command directly, but instead call a script on the operating system so that the spawned task runs in its own django shell!

Teece answered 31/10, 2013 at 17:26 Comment(1)
I used your idea of closing inside the fork instead of before, to make a decorator that I add to my worker functions.Darrickdarrill
D
1

If all you need is I/O parallelism and not processing parallelism, you can avoid this problem by switch your processes to threads. Replace

from multiprocessing import Process

with

from threading import Thread

The Thread object has the same interface as Procsess

Descombes answered 6/11, 2017 at 18:6 Comment(0)
V
1

If you're also using connection pooling, the following worked for us, forcibly closing the connections after being forked. Before did not seem to help.

from django.db import connections
from django.db.utils import DEFAULT_DB_ALIAS

connections[DEFAULT_DB_ALIAS].dispose()
Varicella answered 5/12, 2019 at 1:0 Comment(0)
B
1

One possibility is to use multiprocessing spawn child process creation method, which will not copy django's DB connection details to the child processes. The child processes need to bootstrap from scratch, but are free to create/close their own django DB connections.

In calling code:

import multiprocessing
from myworker import work_one_item # <-- Your worker method

...

# Uses connection A
list_of_items = djago_db_call_one()

# 'spawn' starts new python processes
with multiprocessing.get_context('spawn').Pool() as pool:
    # work_one_item will create own DB connection
    parallel_results = pool.map(work_one_item, list_of_items)

# Continues to use connection A
another_db_call(parallel_results) 

In myworker.py:

import django. # <-\
django.setup() # <-- needed if you'll make DB calls in worker

def work_one_item(item):
   try:
      # This will create a new DB connection
      return len(MyDjangoModel.objects.all())

   except Exception as ex:
      return ex

Note that if you're running the calling code inside a TestCase, mocks will not be propagated to the child processes (will need to re-apply them).

Brenn answered 19/11, 2021 at 17:28 Comment(0)
B
0

You could give more resources to Postgre, in Debian/Ubuntu you can edit :

nano /etc/postgresql/9.4/main/postgresql.conf

by replacing 9.4 by your postgre version .

Here are some useful lines that should be updated with example values to do so, names speak for themselves :

max_connections=100
shared_buffers = 3000MB
temp_buffers = 800MB
effective_io_concurrency = 300
max_worker_processes = 80

Be careful not to boost too much these parameters as it might lead to errors with Postgre trying to take more ressources than available. Examples above are running fine on a Debian 8GB Ram machine equiped with 4 cores.

Bohemianism answered 29/6, 2015 at 19:55 Comment(0)
P
0

Overwrite the thread class and close all DB connections at the end of the thread. Bellow code works for me:

class MyThread(Thread):
    def run(self):
        super().run()

        connections.close_all()

def myasync(function):
    def decorator(*args, **kwargs):
        t = MyThread(target=function, args=args, kwargs=kwargs)
        t.daemon = True
        t.start()

    return decorator

When you need to call a function asynchronized:

@myasync
def async_function():
    ...
Prosperous answered 20/4, 2021 at 7:42 Comment(0)
G
0

Already answered but just to summarize what worked for me. Just before you fork processes, close the db connections. This will make the connection object being copied to each child process closed, and when any request is made on the object django will open the connection again.

db.connections.close_all()

Geothermal answered 29/3 at 13:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.