Execute multiple functions together without losing performance
B

2

10

I have this process that has to make a series of queries, using pl/pgsql:

--process:
SELECT function1();
SELECT function2();
SELECT function3();
SELECT function4();

To be able to execute everything in one call, I created a process function as such:

CREATE OR REPLACE FUNCTION process()
  RETURNS text AS
$BODY$
BEGIN
    PERFORM function1();
    PERFORM function2();
    PERFORM function3();
    PERFORM function4();
    RETURN 'process ended';
END;
$BODY$
  LANGUAGE plpgsql

The problem is, when I sum the time that each function takes by itself, the total is 200 seconds, while the time that the function process() takes is more than one hour!

Maybe it's a memory issue, but I don't know which configuration on postgresql.conf should I change.

The DB is running on PostgreSQL 9.4, in a Debian 8.

Bicapsular answered 20/5, 2015 at 13:33 Comment(8)
Maybe perform leads to a different query plan? Try something like declare MyVar int; begin MyVar := (select function1()); ?Thanhthank
if you SELECT function1(), function2(), function3(), function4(); then Postgresql will be able to run all four functions in parallel. That doesn't mean it will, but at least it can.Graffito
@Thanhthank I tried it, it didn't fix it :(Bicapsular
@Graffito They have to run consecutiveBicapsular
@cliffordheath: Postgres doesn't support parallel execution of one queryThanhthank
Could you roughly sketch what these functions are doing? If they're taking 200 seconds, they're clearly doing some nontrivial work.Alphonse
I think I know why this happens, but not how to avoid it: Postgres functions have an implicit transaction around them, so have to do extra work to maintain logs or locks across all 4 of the inner functions. (You could check if this was the cause by timing a run of all 4 functions within an explicit transaction.)Marinate
@Marinate is right. Read this post.Despoliation
L
13

You commented that the 4 functions have to run consecutively. So it's safe to assume that each function works with data from tables that have been modified by the previous function. That's my prime suspect.

Any Postgres function runs inside the transaction of the outer context. So all functions share the same transaction context if packed into another function. Each can see effects on data from previous functions, obviously. (Even though effects are still invisible to other concurrent transactions.) But statistics are not updated immediately.

Query plans are based on statistics on involved objects. PL/pgSQL does not plan statements until they are actually executed, that would work in your favor. Per documentation:

As each expression and SQL command is first executed in the function, the PL/pgSQL interpreter parses and analyzes the command to create a prepared statement, using the SPI manager's SPI_prepare function.

PL/pgSQL can cache query plans, but only within the same session and (in pg 9.2+ at least) only after a couple of executions have shown the same query plan to work best repeatedly. If you suspect this going wrong for you, you can work around it with dynamic SQL which forces a new plan every time:

EXECUTE 'SELECT function1()';

However, the most likely candidate I see is invalidated statistics that lead to inferior query plans. SELECT / PERFORM statements (same thing) inside the function are run in quick succession, there is no chance for autovacuum to kick in and update statistics between one function and the next. If one function substantially alters data in a table the next function is working with, the next function might base its query plan on outdated information. Typical example: A table with a few rows is filled with many thousands of rows, but the next plan still thinks a sequential scan is fastest for the "small" table. You state:

when I sum the time that each function takes by itself, the total is 200 seconds, while the time that the function process() takes is more than one hour!

What exactly does "by itself" mean? Did you run them in a single transaction or in individual transactions? Maybe even with some time in between? That would allow autovacuum to update statistics (it's typically rather quick) and possibly lead to completely different query plans based on the changed statistic.

You can inspect query plans inside plpgsql functions with auto-explain

If you can identify such an issue, you can force ANALYZE in between statements. Being at it, for just a couple of SELECT / PERFORM statements you might as well use a simpler SQL function and avoid plan caching altogether (but see below!):

CREATE OR REPLACE FUNCTION process()
  RETURNS text
  LANGUAGE sql AS
$func$
   SELECT function1();

   ANALYZE some_substantially_affected_table;

   SELECT function2();
   SELECT function3();

   ANALYZE some_other_table;

   SELECT function4();
   SELECT 'process ended';  -- only last result is returned
$func$;

Also, as long as we don't see the actual code of your called functions, there can be any number of other hidden effects.
Example: you could SET LOCAL ... some configuration parameter to improve the performance of your function1(). If called in separate transactions that won't affect the rest. The effect only last till the end of the transaction. But if called in a single transaction it affects the rest, too ...

Basics:

Plus: transactions accumulate locks, which binds an increasing amount of resources and may cause increasing friction with concurrent processes. All locks are released at the end of a transaction. It's better to run big functions in separate transactions if at all possible, not wrapped in a single function (and thus transaction). That last item is related to what @klin and IMSoP already covered.

Ligule answered 25/5, 2015 at 3:19 Comment(5)
In Postgres earlier versions (probably prior to 9.4) ANALYZE inside a transaction may damage tables when the transaction is rolled back.Despoliation
@klin: And when you say "damage tables" you actually mean inaccurate statistics in the system catalog describing the state that was rolled back? No actual table is harmed in the process. Ever. Also, this is not a problem since at least Postgres 9.3: postgresql.org/message-id/flat/…Ligule
To @Erwin. No, I mean this.Despoliation
@klin: I see what you mean. It was a rare corner case, but the fix was back-patched to all supported versions: 9.4.1, 9.3.6, 9.2.10, 9.1.15, and 9.0.19 (documented in the respective release notes). Everybody following the project's policy to always run the latest available minor release for each major release is fine.Ligule
To @Erwin. ANALYZE inside a function is also a rare corner case. And in my opinion this will not bring any results. The one hour function execution is not caused by improper plan but because of the lack of RAM. Several times I had the opportunity to heal similar cases by sharing unnecessarily complicated transactions to independent parts. What OP is trying to do is just a crime against server efficiency.Despoliation
D
4

Warning for future readers (2015-05-30).

The technique described in the question is one of the smartest ways to effectively block the server.

In some corporations the use of this technology can meet with the reward in the form of immediate termination of the employment contract.

Attempts to improve this method are useless. It is simple, beautiful and sufficiently effective.


In RDMS the support of transactions is very expensive. When executing a transaction the server must create and store information on all changes made to the database to make these changes visible in environment (other concurrent processes) in case of a successful completion, and in case of failure, to restore the state before the transaction as soon as possible. Therefore the natural principle affecting server performance is to include in one transaction a minimum number of database operations, ie. only as much as is necessary.

A Postgres function is executed in one transaction. Placing in it many operations that could be run independently is a serious violation of the above rule.

The answer is simple: just do not do it. A function execution is not a mere execution of a script.

In the procedural languages used to write applications there are many other possibilities to simplify the code by using functions or scripts. There is also the possibility to run scripts with shell.

The use a Postgres function for this purpose would make sense if there were a possibility of using transactions within the function. At present, such a possibility does not exist, although discussions on this issue have already long history (you can read about it e.g. in postgres mailing lists).

Despoliation answered 24/5, 2015 at 23:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.