How to create index for django sitemaps for over 50.000 urls
Asked Answered
D

2

6

I have the following url configuration

url(r'^sitemap\.xml$', index, {'sitemaps': sitemaps}),
url(r'^sitemap-(?P<section>.+)\.xml', cache_page(86400)(sitemap), {'sitemaps': sitemaps}),

and sitemaps include following sitemap

 class ArticlesDetailSiteMap(Sitemap):
    changefreq = "daily"
    priority = 0.9

    def items(self):
        return Article.objects.filter(is_visible=True, date_published__lte=timezone.now())

but there are more than 50.000 articles. So i get timeout error when i try /sitemap-articles.xml because it tries to get all the articles.

Any ideas how should i create an index and make the pagination work here as it says in the documentation below,

https://docs.djangoproject.com/en/dev/ref/contrib/sitemaps/#creating-a-sitemap-index

Diarist answered 14/7, 2014 at 17:4 Comment(1)
Did you figure out how to do it in the end?Brachy
D
6

I have put limit=5000 and issue resolved.

class ArticlesDetailSiteMap(Sitemap):
    changefreq = "daily"
    priority = 0.9
    limit = 5000

    def items(self):
        return Article.objects.filter(is_visible=True, date_published__lte=timezone.now())

and it created paginated urls for all Articles paginated by 5000

Diarist answered 15/7, 2014 at 14:49 Comment(2)
This is correct answer (worked for me). See documentation: docs.djangoproject.com/en/1.9/ref/contrib/sitemaps/….Slumberland
An updated link for the previous comment: docs.djangoproject.com/en/dev/ref/contrib/sitemaps/…Relay
V
3

Try this

from django.core.paginator import Paginator, PageNotAnInteger, EmptyPage

And then

article_list = Article.objects.filter(is_visible=True, date_published__lte=timezone.now())
paginator = Paginator(article_list, 10)
page = request.GET.get('page')


try:
    articles = paginator.page(page)
except PageNotAnInteger:
    articles = paginator.page(1)
except EmptyPage:
    articles = paginator.page(paginator.num_pages)

And you can access the site map using the URLs like sitemap\.xml?page=5

Veta answered 14/7, 2014 at 17:10 Comment(5)
Yea, i know this, but it says in documentation that it handles the pagination itself, after creating an index. I am not sure where did i go wrong, and how to create an index.Diarist
Doc says You should create an index file if one of your sitemaps has more than 50,000 URLs. In this case, Django will automatically paginate the sitemap, and the index will reflect that.Diarist
In that case, try adding a database index on the table on the is_visible field. You can do that using db_index=True. IMO, this really boils down to database optimization. You might end up looking at your queries and trying a bunch of stuff to tune them on the DB side.Veta
is_visible feild is already db_index=True, still having this problemDiarist
Try running some DB diagnostics, in that case. Get the query that django is trying to invoke on the DB server, and run an explain plan on it.Veta

© 2022 - 2024 — McMap. All rights reserved.