Caching sitemaps in Django
Asked Answered
A

4

9

I implemented a simple sitemap class using Django's default sitemap application. As it was taking a long time to execute, I added manual caching:

class ShortReviewsSitemap(Sitemap):
    changefreq = "hourly"
    priority = 0.7

    def items(self):
        # Try to retrieve from cache
        result = get_cache(CACHE_SITEMAP_SHORT_REVIEWS, "sitemap_short_reviews")
        if result!=None:
            return result

        result = ShortReview.objects.all().order_by("-created_at")

        # Store in cache
        set_cache(CACHE_SITEMAP_SHORT_REVIEWS, "sitemap_short_reviews", result)

        return result

    def lastmod(self, obj):
        return obj.updated_at

The problem is that Memcached allows only maximum a 1 MB object. This one was bigger that 1 MB, so storing it into the cache failed:

>7 SERVER_ERROR object too large for cache

The problem is that Django has an automated way of deciding when it should divide the sitemap file into smaller ones. According to the documentation:

You should create an index file if one of your sitemaps has more than 50,000 URLs. In this case, Django will automatically paginate the sitemap, and the index will reflect that.

What do you think would be the best way to enable caching sitemaps?

  • Hacking into Django sitemaps framework to restrict a single sitemap size to, let's say, 10,000 records seems like the best idea. Why was 50,000 chosen in the first place? Google advice? Random number?
  • Or maybe there is a way to allow Memcached to store bigger files?
  • Or perhaps once saved, the sitemaps should be made available as static files? This would mean that instead of caching with Memcached I'd have to manually store the results in the filesystem and retrieve them from there next time when the sitemap is requested (perhaps cleaning the directory daily in a cron job).

All those seem very low level and I'm wondering if an obvious solution exists...

Amplitude answered 17/1, 2010 at 2:46 Comment(4)
Don't do "result!=None", always do "result is not None"Paralyse
why is that? what's the difference?Amplitude
50,000 is given in the Sitemaps protocol.Yan
This limit is defined by Google. See the index documentation at: sitemaps.org/protocol.html#index.Gulosity
D
16

50k is not a hard coded parameter.

You can use class django.contrib.sitemaps.GenericSitemap instead:

class LimitGenericSitemap(GenericSitemap):
    limit = 2000
Dim answered 17/1, 2010 at 2:46 Comment(1)
This was phenomenally helpful. For a working version of this, see my code, here: bitbucket.org/mlissner/legal-current-awareness/src/dc66d2268bec/…Brambling
C
3

You can serve sitemaps also in gzip format, which makes them a lot smaller. XML is suited perfectly for gzip compression. What I sometimes do: Create the gzipped sitemap file(s) in a cronjob and render them as often as necessary. Usually, once a day will suffice. The code for this may look like this. Just make sure to have your sitemap.xml.gz served from your domain root:

    from django.contrib.sitemaps import GenericSitemap
    from django.contrib.sitemaps.views import sitemap
    from django.utils.encoding import smart_str
    import gzip
    sitemaps = {
        'page': GenericSitemap({'queryset': MyModel.objects.all().order_by('-created'), 'date_field': 'created'}),
    }
    f = gzip.open(settings.STATIC_ROOT+'/sitemap.xml.gz', 'wb')
    f.write(smart_str(sitemap(request, sitemaps=sitemaps).render().content))
    f.close()

This should get you started.

Corrincorrina answered 17/1, 2010 at 2:46 Comment(1)
what should request be?Supporting
K
2

Assuming you don't need all those pages in your sitemap then reducing the limit to get the file size down will work fine as described in the previous answer.

If you do want a very large sitemap and do want to use Memcached you could split the content up into multiple chunks, store them under individual keys and then put them back together again on output. To make this more efficient, Memcached supports the ability to get multiple keys at the same time, although I'm not sure whether the Django client supports this capability yet.

For reference, the 1 MB limit is a feature of Memcached to do with how it stores data: http://code.google.com/p/memcached/wiki/FAQ#What_is_the_maximum_data_size_you_can_store?_(1_megabyte)

Kolyma answered 17/1, 2010 at 2:46 Comment(1)
The link is (effectively) broken: "There was an error obtaining wiki data"Pinfish
B
1

I have about 200,000 pages on my site, so I had to have the index no matter what. I ended up doing the hack, limiting the sitemap to 250 links, and also implementing a file-based cache.

The basic algorithm is this:

  • Try to load the sitemap from a file on disk
  • If that fails, generate the sitemap, and
  • If the sitemap contains 250 links (the number set above), save it to disk and then return it.

The end result is that the first time a sitemap is requested, if it's complete, it's generated and saved to disk. The next time it's requested, it's simply served from disk. Since my content never changes, this works very well. However, if I do want to change a sitemap, it's as simple as deleting the file(s) from disk, and waiting for the crawlers to come regenerate things.

The code for the whole thing is here, if you're interested: http://bitbucket.org/mlissner/legal-current-awareness/src/tip/alert/alertSystem/sitemap.py

Maybe this will be a good solution for you too.

Brambling answered 17/1, 2010 at 2:46 Comment(1)
The link is (effectively) broken: "That link has no power here"Pinfish

© 2022 - 2024 — McMap. All rights reserved.