how to speed up a query with partitionkey in azure table storage

Asked 4/11, 2019 at 21:54 Answered 13/11, 2019 at 22:45

c#azure azure-table-storage azure-virtual-network azure-tablequery

How do we increase the speed of this query?

We have approximately 100 consumers within the span of 1-2 minutes executing the following query. Each one of these runs represents 1 run of a consumption function.

        TableQuery<T> treanslationsQuery = new TableQuery<T>()
         .Where(
          TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
           , TableOperators.Or,
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
          )
         );

This query will yield approximately 5000 results.

Full code:

    public static async Task<IEnumerable<T>> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
    {
        var items = new List<T>();
        TableContinuationToken token = null;

        do
        {
            TableQuerySegment<T> seg = await table.ExecuteQuerySegmentedAsync(query, token);
            token = seg.ContinuationToken;
            items.AddRange(seg);
        } while (token != null);

        return items;
    }

    public static IEnumerable<Translation> Get<T>(string sourceParty, string destinationParty, string wildcardSourceParty, string tableName) where T : ITableEntity, new()
    {
        var acc = CloudStorageAccount.Parse(Environment.GetEnvironmentVariable("conn"));
        var tableClient = acc.CreateCloudTableClient();
        var table = tableClient.GetTableReference(Environment.GetEnvironmentVariable("TableCache"));
        var sourceDestinationPartitionKey = $"{sourceParty.ToLowerTrim()}-{destinationParty.ToLowerTrim()}";
        var anySourceDestinationPartitionKey = $"{wildcardSourceParty}-{destinationParty.ToLowerTrim()}";

        TableQuery<T> treanslationsQuery = new TableQuery<T>()
         .Where(
          TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
           , TableOperators.Or,
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
          )
         );

        var over1000Results = table.ExecuteQueryAsync(treanslationsQuery).Result.Cast<Translation>();
        return over1000Results.Where(x => x.expireAt > DateTime.Now)
                           .Where(x => x.effectiveAt < DateTime.Now);
    }

During these executions, when there are 100 consumers, as you can see the requests will cluster and form spikes:

During these spikes, the requests often take over 1 minute:

How do we increase the speed of this query?

Thermoluminescent answered 4/11, 2019 at 21:54 Comment(7)

5000 results seems like you are not filtering nearly enough in the query. Just transfering 5000 results to the code will cost a ton of network time. Never mind that you are still going to do filtering afterwards. | Always do as much filerting a processing in the query. Ideally on rows that got a index and/or are teh result of a computed view. – Sarmatia 4/11, 2019 at 22:2

Are those "Translation" objects big? Why don't u like get some of the parameters instead of gettin` like the whole db? – Crinum 7/11, 2019 at 16:57

@HirasawaYui no they are small – Thermoluminescent 7/11, 2019 at 17:13

you should do more filtering, pulling 5000 results seems meaningless. it's impossible to tell without knowing your data, but I'd say you'd need to figure out a way to partition it in a more meaningful fashion or introduce some sort of filtering in the query – Ingraham 8/11, 2019 at 13:24

How many different partitions are there? – Hindquarter 11/11, 2019 at 19:10

@pe a few hundred – Thermoluminescent 11/11, 2019 at 19:16

caching? perhaps could be one way – Keilakeily 14/11, 2019 at 16:20

There is 3 things you can consider:

1. First of all, get rid of your Where clauses that you perform on the query result. It's better to include clauses in query as much as possible (even better if you have any indexes on your tables include them too). For now, you can change your query as below:

var translationsQuery = new TableQuery<T>()
.Where(TableQuery.CombineFilters(
TableQuery.CombineFilters(
    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey),
    TableOperators.Or,
    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
    ),
TableOperators.And,
TableQuery.CombineFilters(
    TableQuery.GenerateFilterConditionForDate("affectiveAt", QueryComparisons.LessThan, DateTime.Now),
    TableOperators.And,
    TableQuery.GenerateFilterConditionForDate("expireAt", QueryComparisons.GreaterThan, DateTime.Now))
));

Because you have a big amount of data to retrieve it's better to run your queries in parallel. So, you should replace do while loop inside ExecuteQueryAsync method with Parallel.ForEach I wrote based on Stephen Toub Parallel.While; This way it will reduce query execution time. This is a good choice because you can remove Result when you make a call on this method, But it has a little limitation that I'll talk about it after this part of code:

public static IEnumerable<T> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
    var items = new List<T>();
    TableContinuationToken token = null;

    Parallel.ForEach(new InfinitePartitioner(), (ignored, loopState) =>
    {
        TableQuerySegment<T> seg = table.ExecuteQuerySegmented(query, token);
        token = seg.ContinuationToken;
        items.AddRange(seg);

        if (token == null) // It's better to change this constraint by looking at https://www.vivien-chevallier.com/Articles/executing-an-async-query-with-azure-table-storage-and-retrieve-all-the-results-in-a-single-operation
            loopState.Stop();
    });

    return items;
}

And then you can call it in your Get method:

return table.ExecuteQueryAsync(translationsQuery).Cast<Translation>();

As you can see the method itselft is not async (you should change it's name) and Parallel.ForEach is not compatible with passing in an async method. This is why I've used ExecuteQuerySegmented instead. But, to make it more performant and use all the benefits of asynchronous method you can replace the above ForEach loop with ActionBlock method in Dataflow or ParallelForEachAsync extension method from AsyncEnumerator Nuget package.

2.It's a good choice to execute independent parallel queries and then merge the results, even if its performance improvement is at most 10 percent. This gives you time to be able to find the best performance friendly query. But, never forget to include all your constraints in it, and test both ways to know which one better suites your problem.

3. I'm not sure it's a good suggestion or not, But do it and see the results. As described in MSDN:

The Table service enforces server timeouts as follows:

Query operations: During the timeout interval, a query may execute for up to a maximum of five seconds. If the query does not complete within the five-second interval, the response includes continuation tokens for retrieving remaining items on a subsequent request. See Query Timeout and Pagination for more information.

Insert, update, and delete operations: The maximum timeout interval is 30 seconds. Thirty seconds is also the default interval for all insert, update, and delete operations.

If you specify a timeout that is less than the service's default timeout, your timeout interval will be used.

So you can play with timeout and check if there is any performance improvements.

UPDATE 06-30-2021

Thanks to @WouterVanRanst for close looking into the above snippet, I decided to update it and use another overload of Parallel.ForEach method, make the loop single threaded and prevent race condition on TableContinuationToken. You can find the descriptions about partition-local variables with an example here on MSDN. Here's the new look of ExecuteQueryAsync<T> method:

public static IEnumerable<T> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
    TableContinuationToken token = null;
    var items = new List<T>();

    Parallel.ForEach(new InfinitePartitioner(), () =>
    {
        return null as TableQuerySegment<T>;
    }, (ignored, loopState, segment) =>
    {
        segment = table.ExecuteQuerySegmented(query, token) as TableQuerySegment<T>;
        
        token = segment.ContinuationToken;

        if (token == null)
            loopState.Stop();

        return segment;
    },
    (seg) => items.AddRange(seg)
    );

    return items;
}

NOTE: Of course you can polish the code above or find a better approach to prevent race condition, but it's a simple one in no time. I'll be glad to hear your thoughts on it.

Spillar answered 12/11, 2019 at 1:29 Comment(2)

does your snipped not have a race condition with the ContinuationToken? Ie the Parallel.ForEach will race multiple times with the same ContinuationToken, the first one will update the token only to set off a new race? – Randi 29/6, 2021 at 11:31

@WouterVanRanst Yes, It's possibly vulnerable to race condition. Thank you for mentioning. Check out the updated answer. – Spillar 30/6, 2021 at 16:0

  var over1000Results = table.ExecuteQueryAsync(treanslationsQuery).Result.Cast<Translation>();
        return over1000Results.Where(x => x.expireAt > DateTime.Now)
                           .Where(x => x.effectiveAt < DateTime.Now);

Here is one of the problems, you are running the query and then filtering it from memory using these "wheres". Move the filters to before the query runs which should help a lot.

Second you must provide some limit of rows to retrieve from database

Koeppel answered 4/11, 2019 at 22:56 Comment(1)

this didnt make a difference – Thermoluminescent 7/11, 2019 at 16:27

Unfortunately, below query introduces a full table scan:

    TableQuery<T> treanslationsQuery = new TableQuery<T>()
     .Where(
      TableQuery.CombineFilters(
        TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
       , TableOperators.Or,
        TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
      )
     );

You should split it into two Partition Key filters and query them separately, which will become two partition scans and perform more efficiently.

Impressive answered 5/11, 2019 at 0:29 Comment(1)

we saw maybe a 10% improvement with this, but it's not enough – Thermoluminescent 7/11, 2019 at 16:28

So the secret is not only in the code but also in setting up your Azure storage tables.

a) One of the prominent options to optimize your queries in Azure is to introduce caching. This will drastically reduce your overall response times and thereby avoiding bottleneck during the peak hour you have mentioned.

b) Also, When querying entities out of Azure, the fastest possible way to do that is with both the PartitionKey and RowKey. These are the only indexed fields in Table Storage and any query that utilises both of these will be returned in a matter of a few milliseconds. So ensure you use both PartitionKey & RowKey.

See more details here : https://learn.microsoft.com/en-us/azure/storage/tables/table-storage-design-for-query

Hope this helps.

Jaymie answered 13/11, 2019 at 17:31 Comment(0)

-1

note: This is general DB query optimization advice.

It's possible that the ORM is doing something stupid. When doing optimizations it's OK to step down an abstraction layer. So I suggest rewriting the query in the query language (SQL?) to make it easier to see what's going on, and also easier to optimize.

The key to optimizing lookups is sorting! Keeping a table sorted is usually much cheaper compared to scanning the whole table on every query! So if possible, keep the table sorted by the key used in the query. In most database solution this is achieved by creating an index key.

Another strategy that works well if there are few combinations, is to have each query as a separate (temporary in memory) table that is always up to date. So when something is inserted, it's also "inserted" into the "view" tables. Some database solutions calls this "views".

A more brute strategy is to create read-only replicas to distribute the load.

Eogene answered 13/11, 2019 at 22:45 Comment(0)

Recommended topics

Hot tags