Node is 20x faster Than .NET Core in Connecting to Postgres
Asked Answered
T

3

8

I have two servers connecting to a PostgresSQL 9.6 db hosted on Azure. The servers are doing one thing - hitting the Postgres db with a SELECT 1 query every 5 seconds.

Typical time to connect to db and get data:

  • Node: 25 MS
  • .NET Core 3.1 using Npsql 4.1.1(I have tried 4.1.2 too, no diff): 500 MS

My problem is that my .NET Core app is 20x slower than Node in getting data. I believe .NET Core is not pooling connections for some reason. This slowness occurs with both running the app locally and while running it on Azure App Services - no difference. I want to solve the .NET --> Postgres slowness.

Please only skim the relevant details and don't read the whole thing past this point - I believe only the .NET Core code is relevant.

A PsPing to the db from my machine (on which both the Node and the .NET Core apps are running:

Connecting to foobarPostGres:5432 (warmup): from someIp: 19.98ms
Connecting to foobarPostGres:5432: from someIp: 1.65ms
Connecting to foobarPostGres:5432 from someIp: 1.18ms
Connecting to foobarPostGres:5432: from someIp: 1.23ms
Connecting to foobarPostGres:5432: from someIp: 1.06ms

For sake of completeness, a sample of NODE times look like this (note that the first time it establishes a connection, it is also "slow"):

Attempting to establish a connection...
Elapsed ms:  644.1334999799728
RESP:  { '?column?': 1 }
Elapsed ms:  22.76109904050827
RESP:  { '?column?': 1 }
Elapsed ms:  21.984400033950806
RESP:  { '?column?': 1 }
Elapsed ms:  26.043799996376038
RESP:  { '?column?': 1 }
Elapsed ms:  22.538798987865448
RESP:  { '?column?': 1 }

Connection times for .NET Core look like this:

5:13:32 PM: SLOW QUERY, CONN TIME: 4153, QUERY TIME: 18 
5:13:53 PM: SLOW QUERY, CONN TIME: 707, QUERY TIME: 17 
5:14:14 PM: SLOW QUERY, CONN TIME: 589, QUERY TIME: 16
5:14:35 PM: SLOW QUERY, CONN TIME: 663, QUERY TIME: 18 
5:14:56 PM: SLOW QUERY, CONN TIME: 705, QUERY TIME: 16 

Note the super-slow initial connection time and a long time to establish a connection on subsequent requests.

Anyway, because I am desperate, I am going to dump all my code now, with explanations. The connection string looks like this:

public static string CONNECTION_STRING {
  get {
    return $"Server={HOST}; User Id={USER}; Database={DB_NAME}; Port={PORT}; Password={PWD}; SSLMode=Prefer";
  }
}

It is my understanding that I should get connection pooling out of the box if I use this connection string. Note that I have tried turning of SSL on both the db and taking that line out - it did not help.

My health check controller looks like this:

// GET api/health/getdbhealthselectone
[HttpGet]
[Route("getdbhealthselectone")]
public async Task<IActionResult> GetDbHealthSelectOne()
{
  int testData = await _healthCheckRepo.RunHealthCheckSelectOne();
  return Ok(testData);
}

My health check repo method looks like this:

 public async Task<int> RunHealthCheckSelectOne()
    {

      await using var conn = new NpgsqlConnection(AzureDbConnectionInfo.CONNECTION_STRING);

      var connTimer = System.Diagnostics.Stopwatch.StartNew(); // TODO: Remove this testing line
      await conn.OpenAsync();
      connTimer.Stop(); // TODO: Remove this testing line
      var msToConnect = connTimer.ElapsedMilliseconds; // TODO: Remove this testing line

      int testData = 999;
      var jobsQueryTimer = System.Diagnostics.Stopwatch.StartNew(); // TODO: Remove this testing line0
      await using (var cmd = new NpgsqlCommand("SELECT 1", conn))
      await using (var reader = await cmd.ExecuteReaderAsync())
      while (await reader.ReadAsync()) {
        testData = reader.GetInt32(0);
      };

      jobsQueryTimer.Stop(); // TODO: Remove this testing line
      var msToQuery = jobsQueryTimer.ElapsedMilliseconds; // TODO: Remove this testing line

      LogQueryIfSlow(msToConnect, msToQuery, _logger); // TODO: Remove this testing line

      return testData;
    }

Note the timers here - await conn.OpenAsync(); is what takes the bulk of the time by far, the queries themselves are fast. Also, for sake of saving time - I have run this code WITHOUT async before, no difference.

Finally, in case there are dependency injection concerns, the repository is in a class library, the API project references it, and:

services.AddSingleton<IHealthCheckRepository, HealthCheckRepository>();

This is how it sees it.

I believe this is all the relevant information - I have been on the phone with Azure support and they found no issues with the db config. The .NET Core app is super light, so it's not like it's overloaded and it's in testing, so no traffic besides my tests.

Extra: For the sake of completeness, here is my WHOLE node app which hits the db and gets the performance posted (conn data taken out).

const { Pool, Client } = require('pg');
const { performance } = require('perf_hooks');

const pool = new Pool({
  user: 'SECRET',
  host: 'SECRET',
  database: 'SECRET',
  password: 'SECRET',
  port: 5432,
})


function runQuery(pool) {
  var t0 = performance.now();
  pool.query('SELECT 1', (err, res) => {
    if (err) {
      console.log('ERROR: ', err.stack)
    } else {
      console.log('RESP: ', res.rows[0])
    }
    var t1 = performance.now();
    console.log('Elapsed ms: ', t1-t0);
    //pool.end()
});

}

setInterval(() => {runQuery(pool)}, 5000);

EDIT: For posterity, here are the times in .NET Core after fixing the connection pool timeout - it's faster than node, except on that initial connection, which seems to take a while, but I haven't checked some defaults:

CONN: 1710 QUERY: 18
CONN: 0 QUERY: 16
CONN: 0 QUERY: 16
CONN: 0 QUERY: 17
CONN: 0 QUERY: 16
CONN: 0 QUERY: 23
CONN: 0 QUERY: 16
CONN: 0 QUERY: 16
CONN: 0 QUERY: 23
CONN: 0 QUERY: 16
CONN: 0 QUERY: 16
Tourer answered 5/1, 2020 at 19:4 Comment(6)
1) Node is holding Pool, re-test Node code with pool in the function or modify NpgsqlConnection to have a similar implementation they are far from like for like, i dont know if NpgsqlConnection is any gd, use more of an ADO type for a base line. also ensure the constructs are like for like.Trottier
aka Node is making a connection 1's ... method RunHealthCheckSelectOne is creating a new connection on every call.Trottier
Try setting a min pool size. One call every 5 seconds is not much, and the pool might decide to close the unused connection.Cassandracassandre
@Cassandracassandre This was it: Minimum Pool Size=5. I don't understand why it closed the connections in the pool before the 300 seconds were up, but it did. This takes my connection times to 0, without intermittent issues. Please write up the answer so I can award you the bounty. If you want to explain the pooling section of the docs here npgsql.org/doc/connection-string-parameters.html that would be cool, or just make an answer saying "min pool size" =) Thank you again, minor but I wouldn't have solved it without you.Tourer
@Tourer just for kicks could you update with the new timings.... after you made the changes.... as i want to see if it now flips in favor of c#, as i would think it would.Trottier
@Trottier Yep, added to bottom of post. Looks to be about 25% faster on everything but the first request.Tourer
C
11

You need to set a min pool size. Doing so ensures that this amount of connections remains open to the DB regardless of the pool usage.

By default (at least for NPGSQL), the min size is 0, so if the connection is not used for a while, it will be closed.

In your test, you do one call every 5 seconds, which is not much, and the pool might decide to close the unused connection. According to the doc it should keep it open for 300 seconds though, not just 15

Cassandracassandre answered 23/1, 2020 at 15:40 Comment(3)
Will come back to award bounty when it's available.Tourer
after reading again... this is what my comment was leading to... that a new connection was being made each time, where as Node was reusing the Pool connection.. hense my comments. But this is good, as i would of written some "fancy" bloat code to achieve the same thing..so thanks for this share.Trottier
@Trottier Thanks for the advice - you helped a lot as well. I feel like the docs could be a bit clearer on this tbh, even after reading them in retrospect.Tourer
A
3

The first call is almost exactly 5 seconds longer than the rest. This looks like an IP address resolution issue to me. It first picks a method which is defective for the given server, then after 5 seconds it times-out and picks a different method, which works. Then it is cached for a while, and so continues to work well until the cached entry expires.

To see if this is the problem, hardcode the IP address for the database host into your "hosts" file, and see if that fixes the problem. If so, then the root cause becomes a question for your network engineers.

On the database side, you can turn on slow query logging, either log_min_duration_statement or better yet auto_explain.log_min_duration. But if my theory is correct, this won't show anything. The database doesn't know how long you spent trying to look up its IP address.

Agnesse answered 5/1, 2020 at 19:32 Comment(4)
I assume your answer is assuming the 5 second number to be concrete. It's not, sometimes it takes 2 or 3 seconds. Or 6. 5 just seems to be the average for the first call, but there is no concrete number. Will try the logs, ty.Tourer
When I've seen the DNS problem, it was always 5.00 seconds plus one iteration of a normal wait time. Since that is what your log entry shows, it made me suspicious that this is your problem as well. But it might show up with less time, if for example two queries show up in quick succession so once gets to piggyback on the other's resolution request. And if you have an intermediate server (app server or connection pooler) between your computer and the database, the resolution problem could be occurring in either of those places, so you might need to edit two hosts files to be sure.Agnesse
Cool, thanks. I have turned on logs like you suggested, and magically, the call is taking just under the log time I specified (1500ms). Will add logs as soon as I get a logged query.Tourer
Note - this is late, but since I can hit the db perfectly fast from node, I am guessing it's not a DNS issue. Ty again for the idea though.Tourer
P
2

It is possible that the first time the query needs to bring a lot of data from disk to memory, and the subsequent executions find everything already in the shared buffers. You can know this by running

EXPLAIN (ANALYZE, BUFFERS) <your query>

The amount of 'read' and 'hit' will tell you how much as been read from disk, and how much has been hit in RAM.

Pentahedron answered 6/1, 2020 at 8:6 Comment(2)
I added the ANALYZE results. Please let me know if that tells you anything in terms of a possible solution or what I should be looking at. I don't really understand what shared buffers are - memory constantly allocated to the cache?Tourer
Is it possible to also provide the first 28 lines of the output of EXPLAIN (ANALYZE, BUFFERS)? there is no difference on the lines posted. Also, if you can check the values that follows the keyword "actual" you could spot where is the query taking more time. Regarding the shared_buffers, yes, that's works as a cache.Pentahedron

© 2022 - 2024 — McMap. All rights reserved.