How to efficiently serve massive sitemaps in django
Asked Answered
M

4

19

I have a site with about 150K pages in its sitemap. I'm using the sitemap index generator to make the sitemaps, but really, I need a way of caching it, because building the 150 sitemaps of 1,000 links each is brutal on my server.[1]

I COULD cache each of these sitemap pages with memcached, which is what I'm using elsewhere on the site...however, this is so many sitemaps that it would completely fill memcached....so that doesn't work.

What I think I need is a way to use the database as the cache for these, and to only generate them when there are changes to them (which as a result of the sitemap index means only changing the latest couple of sitemap pages, since the rest are always the same.)[2] But, as near as I can tell, I can only use one cache backend with django.

How can I have these sitemaps ready for when Google comes-a-crawlin' without killing my database or memcached?

Any thoughts?

[1] I've limited it to 1,000 links per sitemap page because generating the max, 50,000 links, just wasn't happening.

[2] for example, if I have sitemap.xml?page=1, page=2...sitemap.xml?page=50, I only really need to change sitemap.xml?page=50 until it is full with 1,000 links, then I can it pretty much forever, and focus on page 51 until it's full, cache it forever, etc.

EDIT, 2012-05-12: This has continued to be a problem, and I finally ditched Django's sitemap framework after using it with a file cache for about a year. Instead I'm now using Solr to generate the links I need in a really simple view, and I'm then passing them off to the Django template. This greatly simplified my sitemaps, made them perform just fine, and I'm up to about 2,250,000 links as of now. If you want to do that, just check out the sitemap template - it's all really obvious from there. You can see the code for this here: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/tip/alert/casepage/sitemap.py

Maidenhood answered 11/5, 2010 at 2:8 Comment(0)
M
9

I had a similar issue and decided to use django to write the sitemap files to disk in the static media and have the webserver serve them. I made the call to regenerate the sitemap every couple of hours since my content wasn't changing more often than that. But it will depend on your content how often you need to write the files.

I used a django custom command with a cron job, but curl with a cron job is easier.

Here's how I use curl, and I have apache send /sitemap.xml as a static file, not through django:

curl -o /path/sitemap.xml http://example.com/generate/sitemap.xml
Monasticism answered 11/5, 2010 at 2:46 Comment(4)
I'm working on something similar now. Do you have a code example?Maidenhood
mlissner - To elaborate on dar's answer: 1) Move the Django URL for sitemap.xml to /generate/sitemap.xml ; 2) /path/to/sitemap.xml should be the full system path to a location in your media directory (make sure its writable by the user who will be running the cron job) ; 3) Set up a cron job that pulls from the /generate/sitemap.xml URL and writes the output to that location in your media dir.Situla
I've continued refining this method. Couple additional things to mention. 1), the date_field that's used with Django's sitemap generator MUST be a database index, since it's used to sort the sitemaps. Didn't realize that for a long time, and surprisingly nobody mentioned it here. 2), I permanently cache all sitemaps to disk when they're full (1,000 links on the nose), and then use Django signals to invalidate the cache if an item changes.Maidenhood
And another comment. On MySQL with MyISAM, sitemaps can lock your tables, bringing the site to a crawl since they can have huge OFFSET and LIMIT clauses. The solution is to add connection.cursor().execute('SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED')Maidenhood
W
10

Okay - I have found some more info on this and what amazon are doing with their 6 million or so URLS.

Amazon simply make a map for each day and add to it:

  1. new urls
  2. updated urls

So this means that they end up with loads of site-maps - but the search bot will only look at the latest ones - as the updated dates are recent. I was under the understanding that one should refresh a map - and not include a url more than once. I think this is true. But, Amazon get around this as the site maps are more of a log. A url may appear in a later site-map - as it maybe updated - but Google wont look at the older maps as they are out of date - unless of course it does a major re-index. This approach makes a lot of sense as all you do is simply build a new map - say each day of new and updated content and ping it at google - thus google only needs to index these new urls.

This log approach is a synch to code - as all you need is a static data-store model that stores the XML data for each map. your cron job can build a map - daily or weekly and then store the raw XML page in a blob field or what have you. you can then serve the pages straight from a handler and also the index map too.

I'm not sure what others think but this sounds like a very workable approach and a load off ones server - compared to rebuilding huge map just because a few pages may have changed.

I have also considered that it may be possible to then crunch a weeks worth of maps into a week map and 4 weeks of maps into a month - so you end up with monthly maps, a map for each week in the current month and then a map for the last 7 days. Assuming that the dates are all maintained this will reduce the number of maps tidy up the process - im thinking in terms of reducing 365 maps for each day of the year down to 12.

Here is a pdf on site maps and the approaches used by amazon and CNN.

http://www.wwwconference.org/www2009/proceedings/pdf/p991.pdf

Watterson answered 13/5, 2010 at 0:28 Comment(1)
That's interesting. Thanks for sharing the document.Alphosis
M
9

I had a similar issue and decided to use django to write the sitemap files to disk in the static media and have the webserver serve them. I made the call to regenerate the sitemap every couple of hours since my content wasn't changing more often than that. But it will depend on your content how often you need to write the files.

I used a django custom command with a cron job, but curl with a cron job is easier.

Here's how I use curl, and I have apache send /sitemap.xml as a static file, not through django:

curl -o /path/sitemap.xml http://example.com/generate/sitemap.xml
Monasticism answered 11/5, 2010 at 2:46 Comment(4)
I'm working on something similar now. Do you have a code example?Maidenhood
mlissner - To elaborate on dar's answer: 1) Move the Django URL for sitemap.xml to /generate/sitemap.xml ; 2) /path/to/sitemap.xml should be the full system path to a location in your media directory (make sure its writable by the user who will be running the cron job) ; 3) Set up a cron job that pulls from the /generate/sitemap.xml URL and writes the output to that location in your media dir.Situla
I've continued refining this method. Couple additional things to mention. 1), the date_field that's used with Django's sitemap generator MUST be a database index, since it's used to sort the sitemaps. Didn't realize that for a long time, and surprisingly nobody mentioned it here. 2), I permanently cache all sitemaps to disk when they're full (1,000 links on the nose), and then use Django signals to invalidate the cache if an item changes.Maidenhood
And another comment. On MySQL with MyISAM, sitemaps can lock your tables, bringing the site to a crawl since they can have huge OFFSET and LIMIT clauses. The solution is to add connection.cursor().execute('SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED')Maidenhood
H
3

I'm using django-staticgenerator app for caching sitemap.xml to filesystem and update that file when data updated.

settings.py:

STATIC_GENERATOR_URLS = (
    r'^/sitemap',
)
WEB_ROOT = os.path.join(SITE_ROOT, 'cache')

models.py:

from staticgenerator import quick_publish, quick_delete
from django.dispatch import receiver
from django.db.models.signals import post_save, post_delete
from django.contrib.sitemaps import ping_google

@receiver(post_delete)
@receiver(post_save)
def delete_cache(sender, **kwargs):
    # Check if a Page model changed
    if sender == Page:
        quick_delete('/sitemap.xml')
        # You may republish sitemap file now
        # quick_publish('/', '/sitemap.xml')
        ping_google()

In nginx configuration I redirect sitemap.xml to cache folder and django instance for fallback:

location /sitemap.xml {
    root /var/www/django_project/cache;

    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;

    if (-f $request_filename/index.html) {
        rewrite (.*) $1/index.html break;
    }
    # If file doesn't exist redirect to django
    if (!-f $request_filename) {
        proxy_pass http://127.0.0.1:8000;
        break;
    }    
}

With this method, sitemap.xml will always be updated and clients(like google) gets xml file always staticly. That's cool I think! :)

Hoax answered 25/4, 2012 at 14:43 Comment(0)
S
0

For those who (for whatever reason) would prefer to keep their sitemaps dynamically generated (eg freshness, lazyness). Try django-sitemaps. It's a streaming version of the standard sitemaps. Drop-in replacement. Much faster response time and uses waaaaay less memory.

Spindling answered 7/2, 2013 at 4:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.