Contention problems in Google App Engine
Asked Answered
I

1

5

I'm having contention problems in Google App Engine, and try to understand what's going on.

I have a request handler annotated with:

@ndb.transactional(xg=True, retries=5) 

..and in that code I fetch some stuff, update some others etc. But sometimes an error like this one comes in the log during a request:

16:06:20.930 suspended generator _get_tasklet(context.py:329) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
  Element {
    type: "PlayerGameStates"
    name: "hannes2"
  }
>
)
16:06:20.930 suspended generator get(context.py:744) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
  path <
    Element {
      type: "PlayerGameStates"
      name: "hannes2"
    }
  >
  )
16:06:20.930 suspended generator get(context.py:744) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
  path <
    Element {
      type: "PlayerGameStates"
      name: "hannes2"
    }
  >
  )
16:06:20.936 suspended generator transaction(context.py:1004) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
  path <
    Element {
      type: "PlayerGameStates"
      name: "hannes2"
    }
  >
  )

..followed by a stack trace. I can update with the whole stack trace if needed, but it's kind of long.

I don't understand why this happens. Looking at the line in my code there the exception comes, I run get_by_id on a totally different entity (Round). The "PlayerGameStates", name "hannes2" that is mentioned in the error messages is the parent of another entity GameState, which have been get_async:ed from the database a few lines earlier;

# GameState is read by get_async
gamestate_future = GameState.get_by_id_async(id, ndb.Key('PlayerGameStates', player_key))
...
gamestate = gamestate_future.get_result()
...

Weird(?) thing is, there are no writes to the datastore occurring for that entity. My understanding is that contention errors can come if the same entity is updated at the same time, in parallell.. Or maybe if too many writes occur, in a short period of time..

But can it happen when reading entities also? ("suspended generator get.."??) And, is this happening after the 5 ndb.transaction retries..? I can't see anything in the log that indicates that any retries have been made.

Any help is greatly appreciated.

Incipient answered 3/10, 2015 at 20:18 Comment(4)
I would look at your key structure. Contention isn't just at the entity level. You need to examine the parents as well. Look at the scope of your entity groups so understand why you are having contention.Sproul
Thanks. I have tried to keep entity groups small, and to avoid contention - but I'll examine more. In this case, contention has occurred in the entity group with parent ndb.Key("PlayerGameStates", "hannes2"), right? I still don't understand why reading from it triggers an exception/contention? Where can I read more about this..?Incipient
@TimHoffman Ok, just found this under the Cross-group transactions documentation: "Note: The first read of an entity group in an XG transaction may throw a TransactionFailedError exception if there is a conflict with other transactions accessing that same entity group. This means that even an XG transaction that performs only reads can fail with a concurrency exception."Incipient
Try to architecture your app to use task queues (you can put tasks to queue transitionally) to update root entities and/or use sharding. In most cases it's possible to have solution with independent entities and update aggregates/roots/neighbors via task queues. Don't start transaction until you modify data. It's a good patter first read entity, check if it need to be modified and if yes then start a transaction.Giron
K
8

Yes, contention can happen for both read and write ops.

After a transaction starts - in your case when the handler annotated with @ndb.transactional() is invoked - any entity group accessed (by read or write ops, doesn't matter) is immediately marked as such. At that moment it is not known if by the end of transaction there will a write op or not - it doesn't even matter.

The too much contention error (which is different than a conflict error!) indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!

Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!

What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - the handler in your case.

Retries can actually make matter worse (for a brief time) - throwing even more accesses at the already heavily accessed entity groups - I've seen such patterns with transactions only working after the exponential backoff delays grow big enough to let things cool a bit (if the retries number is large enough) by allowing the transactions already in progress to complete.

My approach to this was to move most of the transactional stuff on push queue tasks, disable retries at the transaction and task level and instead re-queue the task entirely - fewer retries but spaced further apart.

Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)

Another thing can can help (but only a bit) is using a faster (also more expensive) instance type - shorter execution times translate into a slightly lower risk of transactions overlapping. I noticed this as I needed an instance with more memory, which happened to also be faster :)

Kalikalian answered 3/8, 2017 at 23:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.