looking for ideas/alternatives to providing a page/item count/navigation of items matching a GAE datastore query
Asked Answered
I

4

9

I like the datastore simplicity, scalability and ease of use; and the enhancements found in the new ndb library are fabulous.

As I understand datastore best practices, one should not write code to provide item and/or page counts of matching query results when the number of items that match a query is large; because the only way to do this is to retrieve all the results which is resource intensive.

However, in many applications, including ours, it is a common desire to see a count of matching items and provide the user with the ability to navigate to a specific page of those results. The datastore paging issue is further complicated by the requirement to work around limitations of fetch(limit, offset=X) as outlined in the article Paging Through Large Datasets. To support the recommended approach, the data must include a uniquely valued column that can be ordered in the way the results are to be displayed. This column will define a starting value for each page of results; saving it, we can fetch the corresponding page efficiently, allowing navigation to a specific or next page as requested. Therefore, if you want to show results ordered in multiple ways, several such columns may need to be maintained.

It should be noted that as of SDK v1.3.1, Query Cursors are the recommended way to do datastore paging. They have some limitations, including lack of support for IN and != filter operators. Currently some of our important queries use IN, but we'll try writing them using OR for use with query cursors.

Following the guidelines suggested, a user could be given a (Next) and (Prev) navigation buttons, as well as specific page buttons as navigation proceeded. For example if the user pressed (Next) 3 times, the app could show the following buttons, remembering the unique starting record or cursor for each to keep the navigation efficient: (Prev) (Page-1) (Page-2) (Page-3) (Page-4) (Next).

Some have suggested keeping track of counts separately, but this approach isn't practical when users will be allowed to query on a rich set of fields that will vary the results returned.

I'm looking for insights on these issues in general and the following questions specifically:

  1. What navigational options of query results do you provide in your datastore apps to work around these limitations?

  2. If providing users with efficient result counts and page navigation of the entire query result set is a priority, should use of the datastore be abandoned in favor of the GAE MySql solution now being offered.

  3. Are there any upcoming changes in the big table architecture or datastore implementation that will provide additional capability for counting results of a query efficiently?

Many thanks in advance for your help.

Impish answered 22/2, 2012 at 7:31 Comment(0)
C
2

It all depends on how many results you typically get. E.g. by passing .count() a suitable limit you can provide an exact count if the #items is e.g. <= 100 and "many" if there are more. It sounds like you cannot pre-compute all possible counts, but at least you could cache them, thereby saving many datastore ops.

Using NDB, the most efficient approach may either be to request the first page of entities using fetch_page(), and then using the resulting cursor as a starting point for a count() call; or alternatively, you may be better off running the fetch() of the first page and the count() concurrently using its async facilities. The second option may be your only choice if your query does not support cursors. Most IN / OR queries don't currently support cursors, but they do if you order by __key__.

In terms of UI options, I think it's sufficient to offer next and previous page options; the "Gooooooogle" UI that affords skipping ahead several pages is cute but I almost never use it myself. (To implement "previous page", reverse the order of the query and use the same cursor you used for the current page. I'm pretty sure this is guaranteed to work.)

Canonicity answered 23/2, 2012 at 5:6 Comment(4)
Assuming we use the C = query.count(N) approach to show the user "1-20 of C" or "1-20 of many; how do we determine a reasonable value, cost wise, for N. In our use case 100 will be too small. Any suggestions on how to best size this to keep cost$ down? From NDB docs: "Note that count(), while faster than fetch(), still does a lot of work each time it is called". How much quota is used? Guido, Thanks for Python, NDB and your help :) IMO page counts and nav are a nice feature for some apps because users can evaluate and adjust the size of the data matching their params before drilling in.Impish
You can compute the cost using this page: code.google.com/appengine/docs/… . AFAIK a Count() is like a keys-only query. Consider caching the counts. (Depending on the problem, if you have a limited number of counts to cache, you may be able to store the counts in the datastore using the sharded-counter pattern.)Canonicity
Also an update on IN / OR queries: you can turn any query into a cursor-supporting query by adding an ordering by key at the end of the existing sort order. E.g. in NDB: Employee.query(Employee.name.IN(['Joe', 'Jane'])).order(Employee.name, Employee.key).fetch_page(N) -- without the Employee.key order it raises BadArgumentError.Canonicity
Thanks @GuidovanRossum this info is a life saver, if not already included, would be great add for the NDB docs.Impish
S
1

Maybe just aim for this style of paging:

(first)(Prev)(Page1)(Page2)(Page3)....(Last)(next)

That way the total number is not required - you only need your code to know that there is enough results for another 3+ pages. with page size of 10 items per page, you just need to know there are 30+ items.

If you have 60 items, (enough for 6 pages) when youre already on page 4, your code would look forward and realise there are only another 20 records to go, so you could then show the last page number:

(first)(Prev)(Page4)(Page5)(Page6)(next)(last)

Basically for each fetch for the current page, just fetch enough records for another 3 pages of data, count them to see how many more pages you actully have, then dispaly your pager accordingly.

Also, if you just fetch the keys, it will be more efficient than fetching extra items. hope that makes some sense!!?? :)

Selfhypnosis answered 28/2, 2012 at 14:47 Comment(1)
Hey @Joe thanks for the suggestion. You've captured what I describe above in the 5th paragraph of my original question. Guido's answer sharpens the focus on what is possible. Given the cost of the count query, for a large dataset we will definitely use something like "..." when N is smaller than the total number of results. See the discussion under Guido's answer for more context.Impish
S
0
  1. I notice that gmail is ready with some counts - it can tell you how many total emails you've received, and how many are in your inbox, etc - but on other counts, like full-text searches it says you're looking at "1-20 of many" or "1-20 of about 130". Do you really need to display counts for every query, or could you pre-calculate just the important ones?
Scourge answered 22/2, 2012 at 17:32 Comment(1)
Keeping track of a well know totals is definitely easier. In some articles I've read folks use sharded counters for this on GAE: code.google.com/appengine/articles/sharding_counters.html. Our use case is more similar to gmail full-text searches.I think showing the totals and page navigation provides users with a better understanding of their search results before they drill in.That said,it seems if one uses datastore, the only option is the "1-20 of many" approach barring some revelations here.Presumably gmail shows "1-20 of about 130" when the result set is small, maybe via look ahead.Impish
F
0

Since the question was "looking for ideas/alternatives to providing a page", maybe the very simple alternative of fetching 10 pages worth of key_only items, then handling navigation through within this set is worth considering.

I have elaborated on this in answering a similar question, you will find sample code there :

Backward pagination with cursor is working but missing an item

The sample code would be more appropriate for this question. Here is a piece of it:

def session_list():
    page = request.args.get('page', 0, type=int)

    sessions_keys = Session.query().order(-Session.time_opened).fetch(100, keys_only=True)
    sessions_keys, paging = generic_list_paging(sessions_keys, page)
    # generic_list_paging will select the proper sublist.
    sessions = [ sk.get() for sk in sessions_keys ]

    return render_template('generic_list.html', objects=sessions, paging=paging)

See the referenced question for more code.

Of course, if the result set is potentially huge, some limit to the fetch must still be given, the hard limit being 1000 items I think. Obviously, it the result is more than some 10 pages long, the user will be asked to refine by adding criteria.

Dealing with paging within a few hundreds of keys_only items is really so much simpler, that it's definitely worth considering. It makes it quite easy to provide direct page navigation as mentionned in the question. The actual entity items are only fetched for the actual current page, the rest is only keys so it's not so costly. And you may consider keeping the keys_only result set in memcache for a few minutes so that a user quickly browsing through pages will not require the same query to be performed again.

Fermi answered 10/4, 2015 at 15:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.