How to generate sitemap on a highly dynamic website?
Asked Answered
C

5

62

Should a highly dynamic website that is constantly generating new pages use a sitemap? If so, how does a site like stackoverflow.com go about regenerating a sitemap? It seems like it would be a drain on precious server resources if it was constantly regenerating a sitemap every time someone adds a question. Does it generate a new sitemap at set intervals (e.g. every four hours)? I'm very curious how large, dynamic websites make this work.

Cristinecristiona answered 8/7, 2009 at 17:15 Comment(3)
Do you have a specific need for a sitemap? They're a little old-fashioned; some sites don't provide them at all.Providing
Can you specify the type of sitemap your are talking about. There are several implementations of sitemaps that have for various purposes. For example there are the xml based sitemaps used for search engines and then the sitemaps that are for users to find particular page on a site.Rafaello
Pretty sure they're talking about a sitemap.xml file - a user-accessible sitemap that listed every item in a site with 100,000+ items would be utterly and self-evidently useless.Stickpin
L
54

On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.

That is the only way search engines like Google can effectively crawl the site.

Jeff Atwood talks about it in a blog post: The Importance of Sitemaps

This is from Google's webmaster help page on sitemaps:

Sitemaps are particularly helpful if:

  • Your site has dynamic content.
  • Your site has pages that aren't easily discovered by Googlebot during the crawl process - for example, pages featuring rich AJAX or Flash.
  • Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
  • Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.
Librate answered 8/7, 2009 at 17:20 Comment(7)
What about when a user deletes a question? Is there ever a time that they would regenerate the entire sitemap?Cristinecristiona
No need to update for a deletion, as long as your site returns a 404 on that question. Google'll ignore the 404 and remove it from index, so no harm done.Stickpin
Does this mean SO's sitemap items won't ever get the lastmod field updated? How will the search engines know when to reindex a question page?Diannediannne
why can't I find stackoverflow's sitemap? It is clearly stated as "sitemap.xml" in the robots.txt, but that file does not appear to exist: stackoverflow.com/sitemap.xmlCristinecristiona
lastmod is a stickier issue, which is why I'd use the technique posted in my answer - generate it out of the database on-demand. StackOverflow may do this - they block access to sitemap.xml for non-Googlebot user agents, so presumably there's a load on the server from accessing it.Stickpin
@average - If you spoof a Googlebot user agent, it shows up. They block it for normal browsers.Stickpin
It's funny, now that I think about it, I actually read that article at codinghorror a while back, but I completely forgot about it. I must've read it before my morning coffee....Cristinecristiona
S
16

There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).

To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.

Stickpin answered 8/7, 2009 at 17:20 Comment(6)
It's implied in the question. No "large, dynamic website" would ever add every question posted to it into a user-accessible sitemap.Stickpin
this is a good answer. i would've accepted it, but Robert's is more nicely formatted with shiny hyperlinks and a quote box!Cristinecristiona
@Stickpin (1) Is it an established practice to generate sitemap on-demand instead of having static pages? Do you recommend that always or in special scenarios? (2) If I have to generate "many" sitemaps dynamically on-demand from database, how do I decide on partitioning rule?Sorbose
@Sorbose In all honesty, the established practice is "don't have a sitemap" these days.Stickpin
@Stickpin actually my website falls in the category of following scenario - "Your site has a large archive of content pages that are not well linked to each other, or are not linked at all"Sorbose
Why no sitemap anymore? Is Google smarter, or are developers designing their sites better, or something else? What's changed?Joab
E
5

I'd like to share my solution here just in case it helps someone as well. It took me reading this question and many others to decide what to do.

My site structure.

Static pages

  • Home (Highly dynamic. Cached for 30 mins)
  • Artists, Albums, Songs, Playlists and Albums (Paginated List)
  • Legal (Static page with Terms etc)

...etc

Dynamic Pages

  • Artists, Albums, Songs, Playlists and Albums detail pages

My approach.

sitemap.xml: This url generates a <sitemapindex /> with the first item being /sitemap-main.xml. The number of Artists, Albums, Songs etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.

So for e.g, 1900 songs = 1.9 = 2. I generate. add the urls /sitemap-songs-0.xml and /sitemap-songs-1.xml to the index. I repeat this for all other items. Basically, I am paginating.

The output is returned uncached. I want this to always be fresh.


sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.


sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.

$app->get('/sitemap-:type-:page.xml', function ($type, $page) use ($app) {...

I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks. Else, I only cache it for a few hours.

I guess this can help anyone else implement their own system.

Earhart answered 9/1, 2016 at 21:8 Comment(0)
A
3

For a highly dynamic site, I wrote a cron job at my server which runs on daily basis. It makes a rest call to my backend every day, and generates a new sitemap according to all newly generated content, and returns the sitemap in the form of an xml file. This new sitemap overrides the previous one and keeps my website updated according to all the changes. Changing sitemap for each newly added dynamic content is not a good approach I think

Abridgment answered 8/11, 2016 at 8:9 Comment(0)
G
2

Even on something like StackOverflow, there is a certain amount of static organization; there are FAQs, tag pages, question pages, user pages, badge pages, etc.; I'd say in a very dynamic site, the best way to approach a sitemap would be to have a map of the categorizations; each node in the sitemap can point to a page of the dynamically generated data (a node for a question page, a node for a user page, etc.).

Of course, a sitemap may not even be appropriate for a given site; there's a certain amount of judgment call required there.

Gun answered 8/7, 2009 at 17:20 Comment(4)
I countered your down vote as well. I guess someone disagrees with us..lolRafaello
Judging by the accepted answer, the OP disagrees with you too.Stickpin
@ceejayoz: yup, apparently, however, I think both MitMaro and I answered the question the OP asked; as it turns out, they wanted specificity, but they didn't specify the specificity they wanted, so...Gun
@McWafflestix So you're going to leave the downvotes on the answers that correctly understood and answered the original poster's question as he intended it? Way to abuse the system...Stickpin

© 2022 - 2024 — McMap. All rights reserved.