Large SQL transaction: runs out of memory on PostgreSQL, yet works on SQL Server
Asked Answered
P

3

17

I have decided to move my C# daemon application (using dotConnect as ADO.NET provider) from SQL Server 2008 R2 to PostgreSQL 9.0.4 x64 (on Windows Server 2008 R2). Therefore I slightly modified all queries to match PostgreSQL syntax and... got stuck on behavior which never happened with the same queries on SQL Server (not even on lowly Express edition).

Let's say the database contains 2 very simple tables without any relation to each other. They look somewhat like this: ID, Name, Model, ScanDate, Notes. I have a transformation process which reads data over TCP/IP, processes it, starts a transaction and puts the results into aforementioned 2 tables using vanilla INSERTs. The tables are initially empty; no BLOB columns. There are about 500.000 INSERTs on a bad day, all wrapped in a single transaction (and cannot be split into multiple transactions, btw). No SELECTs, UPDATEs or DELETEs are ever made. An example of INSERT (ID is bigserial - autoincremented automatically):

INSERT INTO logs."Incoming" ("Name", "Model", "ScanDate", "Notes")
VALUES('Ford', 'Focus', '2011-06-01 14:12:32', NULL)

SQL Server calmly accepts the load while maintaining a reasonable Working Set of ~200 MB. PostgreSQL, however, takes up additional 30 MB each second the transaction runs (!) and quickly exhausts system RAM.

I've done my RTFM and tried fiddling with postgresql.conf: setting "work_mem" to a minimum 64 kB (this slightly slowed down the RAM hogging), reducing "shared_buffers" / "temp_buffers" to minimum (no difference), - but to no avail. Reducing transaction isolation level to Read Uncommitted didn't help. There are no indexes except the one on ID BIGSERIAL (PK). SqlCommand.Prepare() makes no difference. No concurrent connections ever are established: daemon uses the database exclusively.

It may seem PostgreSQL cannot cope with mind-numbingly simple INSERT-fest, while SQL Server can do that. Maybe it's a PostgreSQL snapshot-vs-SQL Server locks isolation difference? It's a fact for me: vanilla SQL Server works, while neither vanilla nor tweaked PostgreSQL does.

What can I do to make PostgreSQL memory consumption to remain flat (as is apparently the case with SQL Server) while INSERT-based transaction runs?

EDIT: I have created an artificial testcase:

DDL:

CREATE TABLE sometable
(
  "ID" bigserial NOT NULL,
  "Name" character varying(255) NOT NULL,
  "Model" character varying(255) NOT NULL,
  "ScanDate" date NOT NULL,
  CONSTRAINT "PK" PRIMARY KEY ("ID")
)
WITH (
  OIDS=FALSE
);

C# (requires Devart.Data.dll & Devart.Data.PostgreSql.dll)

PgSqlConnection conn = new PgSqlConnection("Host=localhost; Port=5432; Database=testdb; UserId=postgres; Password=###########");
conn.Open();
PgSqlTransaction tx = conn.BeginTransaction(IsolationLevel.ReadCommitted);

for (int ii = 0; ii < 300000; ii++)
{
    PgSqlCommand cmd = conn.CreateCommand();
    cmd.Transaction = tx;
    cmd.CommandType = CommandType.Text;
    cmd.CommandText = "INSERT INTO public.\"sometable\" (\"Name\", \"Model\", \"ScanDate\") VALUES(@name, @model, @scanDate) RETURNING \"ID\"";
    PgSqlParameter parm = cmd.CreateParameter();
    parm.ParameterName = "@name";
    parm.Value = "SomeName";
    cmd.Parameters.Add(parm);

    parm = cmd.CreateParameter();
    parm.ParameterName = "@model";
    parm.Value = "SomeModel";
    cmd.Parameters.Add(parm);

    parm = cmd.CreateParameter();
    parm.ParameterName = "@scanDate";
    parm.PgSqlType = PgSqlType.Date;
    parm.Value = new DateTime(2011, 6, 1, 14, 12, 13);
    cmd.Parameters.Add(parm);

    cmd.Prepare();

    long newID = (long)cmd.ExecuteScalar();
}

tx.Commit();

This recreates the memory hogging. HOWEVER: if the 'cmd' variable is created and .Prepare()d outside the FOR loop, the memory does not increase! Apparently, preparing multiple PgSqlCommands with IDENTICAL SQL but different parameter values does not result in a single query plan inside PostgreSQL, like it does in SQL Server.

The problem remains: if one uses Fowler's Active Record dp to insert multiple new objects, prepared PgSqlCommand instance sharing is not elegant.

Is there a way/option to facilitate query plan reuse with multiple queries having identical structure yet different argument values?

UPDATE

I've decided to look at the simplest possible case - where a SQL batch is run directly on DBMS, without ADO.NET (suggested by Jordani). Surprisingly, PostgreSQL does not compare incoming SQL queries and does not reuse internal compiled plans - even when incoming query has the same identical arguments! For instance, the following batch:

PostgreSQL (via pgAdmin -> Execute query) -- hogs memory

BEGIN TRANSACTION;

INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times

COMMIT;

SQL Server (via Management Studio -> Execute) -- keeps memory usage flat

BEGIN TRANSACTION;

INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times

COMMIT;

and the PostgreSQL log file (thanks, Sayap!) contains:

2011-06-05 16:06:29 EEST LOG:  duration: 0.000 ms  statement: set client_encoding to 'UNICODE'
2011-06-05 16:06:43 EEST LOG:  duration: 15039.000 ms  statement: BEGIN TRANSACTION;

INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- 99998 lines of the same as above
COMMIT;

Apparently, even after transmitting the whole query to the server as-is, the server cannot optimize it.

ADO.NET driver alternative

As Jordani suggested, I've tried NpgSql driver instead of dotConnect - with the same (lack of) results. However, Npgsql source for .Prepare() method contains such enlightening lines:

planName = m_Connector.NextPlanName();
String portalName = m_Connector.NextPortalName();
parse = new NpgsqlParse(planName, GetParseCommandText(), new Int32[] { });
m_Connector.Parse(parse);

The new content in the log file:

2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  statement: BEGIN; SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
2011-06-05 15:25:26 EEST LOG:  duration: 1.000 ms  parse npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  bind npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 1.000 ms  execute npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  parse npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  bind npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  execute npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  parse npgsqlplan3: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"

Inefficiency is quite obvious in this log excerpt...

Conclusions (such as they are)

Frank's note about WAL is another awakening: something else to configure that SQL Server hides away from a typical MS developer.

NHibernate (even in its simplest usage) reuses prepared SqlCommands properly...if only it was used from the start...

it is obvious that an architectural difference exists between SQL Server and PostgreSQL, and the code specifically built for SQL Server (and thus blissfully unaware of the 'unable-to-reuse-identical-sql' possibility) will not work efficiently on PostgreSQL without major refactoring. And refactoring 130+ legacy ActiveRecord classes to reuse prepared SqlCommand objects in a messy multithreaded middleware is not a 'just-replace-dbo-with-public'-type affair.

Unfortunately for my overtime, Eevar's answer is correct :)

Thanks to everyone who pitched in!

Pyrethrum answered 4/6, 2011 at 17:46 Comment(9)
Why are you moving from SQL server?Sapient
You might also try your question on dba.stackexchange.comHabitual
Why can't you split it into multiple transaction if no SELECTs, UPDATEs or DELETEs are ever made?Coxa
@svick: 1) TCO 2) ability to run on Linux (security; some of the best dedicated server providers use Linux)Pyrethrum
@letitbee: naturally, the real database is far larger than 2 tables. I stripped it down to two tables with INSERTs only and found that memory hogging still occur - so I used this as problem description. However, in the real DB quite a lot of stuff happens, and splitting into multiple transactions is not acceptable.Pyrethrum
@proglamer: Have you tried with INSERTing multiple rows in every INSERT (say in tens or hundreds) ?Supremacist
@ypercube: no, I haven't; I was under impression that creating (COL_COUNT * ROW_COUNT) SqlParameters for a single SqlCommand is not Good Practice when those SqlParameters climb up to thousands...?Pyrethrum
@Proglamer: how often does the application perform an explicit COMMIT?Leucine
@Proglamer: I think you should check what kind of statements are being run in SQL Server by using SQL Profiler, and then compare those with PostgreSQL by setting log_min_duration to 0. You can't RTFM while blindfolded.Unholy
P
9

I suspect you figured it out yourself. You're probably creating 500k different prepared statements, query plans and all. Actually, it's worse than that; prepared statements live outside of transaction boundaries and persist until the connection is closed. Abusing them like this will drain plenty of memory.

If you want to execute a query several times but avoid the planning overhead for each execution, create a single prepared statement and reuse that with new parameters.

If your queries are unique and ad-hoc, just use postgres' normal support for bind variables; no need for the extra overhead from prepared statements.

Proviso answered 5/6, 2011 at 0:4 Comment(1)
@Proglamer: If you are doing bulk inserts, consider using PgSql COPY command or the PgSqlLoader construct.Caren
A
9

Reducing work_mem and shared_buffers is not a good idea, databases (including PostgreSQL) love RAM.

But this might not be your biggest problem, what about the WAL-settings? wal_buffers should be large enough to hold the entire transaction, all 500k INSERT's. What is the current setting? And what about checkpoint_segments?

500k INSERT's should not be a problem, PostgreSQL can handle this without memory problems.

http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html

Asylum answered 4/6, 2011 at 18:59 Comment(1)
The docs say "The setting need only be large enough to hold the amount of WAL data generated by one typical transaction" (emphasis on 'typical'). A larger transaction will just write more WAL to disk, so that is not a part of this problem at all.Mackenzie
P
9

I suspect you figured it out yourself. You're probably creating 500k different prepared statements, query plans and all. Actually, it's worse than that; prepared statements live outside of transaction boundaries and persist until the connection is closed. Abusing them like this will drain plenty of memory.

If you want to execute a query several times but avoid the planning overhead for each execution, create a single prepared statement and reuse that with new parameters.

If your queries are unique and ad-hoc, just use postgres' normal support for bind variables; no need for the extra overhead from prepared statements.

Proviso answered 5/6, 2011 at 0:4 Comment(1)
@Proglamer: If you are doing bulk inserts, consider using PgSql COPY command or the PgSqlLoader construct.Caren
E
1
  1. I agree completely to Frank.

  2. prepared PgSqlCommand instance sharing is not elegant.

Why?? Is it not possible to have outside loop:

    cmd = conn.CreateCommand(); 
    parm1 = cmd.CreateParameter();
    parm1.ParameterName = "@name";
    parm2 = cmd.CreateParameter();
    parm2.ParameterName = "@model";
    parm3 = cmd.CreateParameter(); 
    parm3.ParameterName = "@scanDate"; 

Also I have found this in msdn:

// NOTE:
// For optimal performance, make sure you always set the parameter
// type and the maximum size - this is especially important for non-fixed
// types such as NVARCHAR or NTEXT;

If dotConnect is not working as SQL server provider it is not good (is latest version/ bug fixed). Can you use other provider?

You have to check who "is eating" memory - db server or provider. You can also test PostgreSql if you generate sql script and "psql.exe

Enamor answered 5/6, 2011 at 0:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.