Does Facebook's webcrawling bots respect the Crawl-delay:
directive in robots.txt
files?
We don't have a crawler. We have a scraper that scrapes meta data on pages that have like buttons/are shared on FB.
crawl-delay
. If you think the answer about crawl-delay
isn't important because it's a "scraper" you're the kind of Facebook employee that makes dealing with these Facebook bots a nightmare. –
Mammalian No, it doesn't respect robots.txt
Contrary to other answers here, facebookexternalhit behaves like the meanest of crawlers. Whether it got the urls it requests from crawling or from like buttons doesn't matter so much when it goes through every one of those at an insane rate.
We sometimes get several hundred hits per second as it goes through almost every url on our site. It kills our servers every time. The funny thing is that when that happens, we can see that Googlebot slows down and waits for things to settle down before slowly ramping back up. facebookexternalhit, on the other hand, just continues to pound our servers, often harder than the initial bout that killed us.
We have to run much beefier servers than we actually need for our traffic, just because of facebookexternalhit. We've done tons of searching and can't find a way to slow them down.
How is that a good user experience, Facebook?
For a similar question, I offered a technical solution that simply rate-limits load based on the user-agent.
Code repeated here for convenience:
Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.
In PHP, execute the following code as quickly as possible for every request.
define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit
if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
$fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
$lastTime = fread( $fh, 100 );
$microTime = microtime( TRUE );
// check current microtime with microtime of last access
if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
// bail if requests are coming too quickly with http 503 Service Unavailable
header( $_SERVER["SERVER_PROTOCOL"].' 503' );
die;
} else {
// write out the microsecond time of last access
rewind( $fh );
fwrite( $fh, $microTime );
}
fclose( $fh );
} else {
header( $_SERVER["SERVER_PROTOCOL"].' 503' );
die;
}
}
Facebook actually uses this algorithm that you can check for yourself here:
http://developers.facebook.com/tools/debug
Facebook cache lifespan of this data is variable, but it's between 24-48hours from my experience.
You -can- however make the cache "invalidate" if you add a portion to your url so that users will share the new one, OR you can provide bit.ly (and the like) links that will have the same effect.
Since it's not actually crawling, you can't force it to delay a scrape (and you shouldn't, as this would create bad user experience - they would wait a while for the scraper to finish and they would be provided with a shareable link that is not pretty). You COULD however trigger manually the scraping at set intervals so as to ensure better user experience (they wouldn't wait for data to be cached) and server load balancing.
wow all these years later and this is all still relevant. Facebookexternalhit is still a bad actor. They most definitely are crawling based on our logs. This would all be so much easier if they used a different UA for transactional meta data vs crawling!
we are dealing with both googlebot, now googleother, and facebookexternalhit all fighting for dominance over our server resources. So far ALL attempts to resolve by contacting the companies have gone into /dev/null - leaving us with the only choice to block their entire ASNs - NOT cool at all!
if you are running on ubuntu server and you are using ufw firewall you may want to try
ufw limit proto tcp from 31.13.24.0/21 port 80 to any
for all of these IP addresses: 31.13.24.0/21 31.13.64.0/18 66.220.144.0/20 69.63.176.0/20 69.171.224.0/19 74.119.76.0/22 103.4.96.0/22 173.252.64.0/18 204.15.20.0/22
as shown here: What's the IP address range of Facebook's Open Graph crawler?
We don't have a crawler. We have a scraper that scrapes meta data on pages that have like buttons/are shared on FB.
crawl-delay
. If you think the answer about crawl-delay
isn't important because it's a "scraper" you're the kind of Facebook employee that makes dealing with these Facebook bots a nightmare. –
Mammalian © 2022 - 2025 — McMap. All rights reserved.
facebookexternalhit/*
where * is a version number. See: facebook.com/externalhit_uatext.php and developers.facebook.com/docs/best-practices/… – AculeusCrawl-delay
.Crawl-delay
is related to how fast a web crawler visits urls for a site. So if you have 100 urls, say, on your site, Crawl delay that all those urls don't get his simultaneously. Rather, they will get hit at an interval of whatever the crawl delay specifies. So for 100 pages at 15 seconds between, there will be a 25 minute "rest" period. The intent is to not overwhelm a site. – Aculeus