Akka .NET Connection Pool Timeout Issues
Asked Answered
C

1

7

We are creating a new system using Akka.NET and have a basic cluster setup with sharding and persistence.

We've used the official documentation as well as some Petabridge blog posts to get sharding working correctly. However, we've hit a problem where the shards are exceeding the maximum number of connections allowed by the SQL Server connection pool. As such, we're getting the following message...

2017-04-20 14:04:31.433 +01:00 [Warning] "Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached."

We believe this is happening when the shards are updating the sharding journal.

How come the sharding module doesn't manage its SQL connections properly? Is there a configuration issue here?

Is it possible to get it to retry when this kind of error occurs? As it stands we lose messages for each instance of this error.

Here's the relevant HOCON

cluster.sharding {
    journal-plugin-id = "akka.persistence.journal.sharding"
    snapshot-plugin-id = "akka.persistence.snapshot-store.sharding"
}
persistence {
    journal {
        plugin = "akka.persistence.journal.sql-server"
        sql-server {
            class = "Akka.Persistence.SqlServer.Journal.SqlServerJournal, Akka.Persistence.SqlServer"
            connection-string = "Server=.;Database=akkasystem;Integrated Security=true"
            schema-name = dbo
            auto-initialize = on
        }
        # a separate config used by cluster sharding only 
        sharding {
            connection-string = "Server=.;Database=akkasystem;Integrated Security=true"
            auto-initialize = on
            plugin-dispatcher = "akka.actor.default-dispatcher"
            class = "Akka.Persistence.SqlServer.Journal.SqlServerJournal, Akka.Persistence.SqlServer"
            connection-timeout = 30s
            schema-name = dbo
            table-name = ShardingJournal
            timestamp-provider = "Akka.Persistence.Sql.Common.Journal.DefaultTimestampProvider, Akka.Persistence.Sql.Common"
            metadata-table-name = ShardingMetadata
        }
    }
    snapshot-store {
        sharding {
            class = "Akka.Persistence.SqlServer.Snapshot.SqlServerSnapshotStore, Akka.Persistence.SqlServer"
            plugin-dispatcher = "akka.actor.default-dispatcher"
            connection-string = "Server=.;Database=akkasystem;Integrated Security=true"
            connection-timeout = 30s
            schema-name = dbo
            table-name = ShardingSnapshotStore
            auto-initialize = on
        }
    }
}
Cryometer answered 20/4, 2017 at 13:39 Comment(0)
K
1

It may be a sign that SQL journal is flooded with incoming events so heavily, that connection timeout triggers, while an event is awaiting for the next connection from the pool to be freed up.

From your config I suspect, that this might have happen if you've started persisting an events and creating shards/entities with high ratio. Existing SQL journals are going to get significant speed boost in the near future (see batching journals). Hopefully this could help to solve your problems.

Kakemono answered 21/4, 2017 at 7:46 Comment(2)
The main concern we have at the moment is that we're effectively getting at-most-once delivery on these messages. Whenever a SQL timeout occurs the action that should have been performed gets lost. We were hoping that by using Akka.Persistence we would be able to achieve at-least-once delivery. This doesn't seem to have been the case so far. Could you advise on how we might achieve this so that actions are retried / messages aren't lost when a timeout occurs?Cryometer
This snippet is more or less example of how to build a proxy actor working as an at-least-once delivery gateway. However remember that it will constrain your actors to handle incoming messages in idempotent fashion. Also if you really want to get an at-least-once delivery semantic, often it's a good idea to simply put a queue (i.e. RabbitMQ or even Kafka) in front of your communication chain, and on failure simply restart the process for unhandled message.Kakemono

© 2022 - 2024 — McMap. All rights reserved.