Facebook and Crawl-delay in Robots.txt?
Asked Answered
A

6

13

Does Facebook's webcrawling bots respect the Crawl-delay: directive in robots.txt files?

Aculeus answered 10/10, 2011 at 17:37 Comment(4)
I wasn't aware facebook had bots... interesting!Lumbye
facebookexternalhit/* where * is a version number. See: facebook.com/externalhit_uatext.php and developers.facebook.com/docs/best-practices/…Aculeus
Doesn't Facebook only crawl an item once when it's added? I recall several instances where you had to explicitly get Facebook to crawl an item again to get it to update it's copy.Gilberto
That's not Crawl-delay. Crawl-delay is related to how fast a web crawler visits urls for a site. So if you have 100 urls, say, on your site, Crawl delay that all those urls don't get his simultaneously. Rather, they will get hit at an interval of whatever the crawl delay specifies. So for 100 pages at 15 seconds between, there will be a 25 minute "rest" period. The intent is to not overwhelm a site.Aculeus
J
-5

We don't have a crawler. We have a scraper that scrapes meta data on pages that have like buttons/are shared on FB.

Jitney answered 17/10, 2011 at 17:41 Comment(6)
"It depends on what the meaning of the word 'is' is." Thanks for the non-answer. Pulling down 100 pages in a few seconds is a crawl, whatever you want to call it. Clearly this "scraper" goes apenuts and starts pulling down LOTS of pages that have nothing to do with the link being posted. Or FB is stealthily creating a search competitor to Google. Or maybe someone else's crawler is executing external LIKE buttons? Something crazy is going on.Lauricelaurie
I removed open graph and this facebookexternalhit/1.1 stopped messing up my database connection on two sites that have been running smoothly on two different hosts for 8 years each.Insomuch
This isn't a particularly helpful answer. We've seen a substantial amount of crawler/spider-like behaviour from facebook's servers. Requests are made at what could be considered an abusive rate, causing the database to lock up etc.Rida
According to your own documentation you do have a crawler: developers.facebook.com/docs/sharing/webmasters/crawlerFrisse
Please note this longstanding and abhorrent bug in the Facebook system, that sends both scrape traffic with no UA (GROSS) and also floods of scrape traffic that can take down your site developers.facebook.com/bugs/1654459311255613Mammalian
And also gross answer here for 1) not specifying your role in FB in the answer, since you say "we". 2) nitpicking about the language, you could have clarified without being dismissive. 3) Not answering the question at all since clearly the info needed is whether the "scraper" respects crawl-delay. If you think the answer about crawl-delay isn't important because it's a "scraper" you're the kind of Facebook employee that makes dealing with these Facebook bots a nightmare.Mammalian
D
17

No, it doesn't respect robots.txt

Contrary to other answers here, facebookexternalhit behaves like the meanest of crawlers. Whether it got the urls it requests from crawling or from like buttons doesn't matter so much when it goes through every one of those at an insane rate.

We sometimes get several hundred hits per second as it goes through almost every url on our site. It kills our servers every time. The funny thing is that when that happens, we can see that Googlebot slows down and waits for things to settle down before slowly ramping back up. facebookexternalhit, on the other hand, just continues to pound our servers, often harder than the initial bout that killed us.

We have to run much beefier servers than we actually need for our traffic, just because of facebookexternalhit. We've done tons of searching and can't find a way to slow them down.

How is that a good user experience, Facebook?

Dinnage answered 6/2, 2012 at 18:27 Comment(2)
One wishes to pay them back with a reverse-slow-loris when seeing such behaviour... but with their infrastructure they wouldn't even notice.Oudh
For some reason, SO won't let me comment on another answer, but Hank's answer is great and similar to what we implemented (though using a custom Django middleware).Dinnage
L
6

For a similar question, I offered a technical solution that simply rate-limits load based on the user-agent.

Code repeated here for convenience:

Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 503' );
        die;
    }
}
Lauricelaurie answered 7/11, 2012 at 19:30 Comment(6)
I am a total idiot with PHP. Is there a quick guide on where to stick this. My site is a php site using a CMS and smarty. It has an index.php file that calls the plugins. So do I make this an include and call it?Insomuch
@Insomuch if you are using a PHP CMS, it likely has a PHP configuration file (ie 'config.php') that defines database connection variables. I would put it in that file as that is not likely to get overwritten during upgrades, etc.Lauricelaurie
WARNING: When a URL is first shared on FB, it does an initial scrape of the open graph meta tags and grabs a copy of the og:image (featured image) for the URL. If you block these initial requests, your FB sharing previews will be completely broken, a state which can last days/weeks in many situations. You REALLY don't want to accidentally block these requests, which means a rate-limiting solution like the one in this answer is dangerous, as it has no way of knowing if it's a first-scrape or a re-scrape of the URL. FACEBOOK SUCKS for putting us in this situation.Mammalian
@Mammalian - it's true what you say, but what alternative is there except to continue highlighting the issues here and in the bug report here: developers.facebook.com/support/bugs/1654459311255613Pearly
Hey Sol. Yeah that’s just it, we have no option other than complaining. Anything we use to block it is likely to kill our Facebook shares. I just want to warn people that clever blocking will have surprising side effects.Mammalian
I just want to come back to this thread and say this is still working for me :-) Facebookexternalhit is still trawling my servers. If only there was a server wide method to apply this (Apache)Pearly
S
3

Facebook actually uses this algorithm that you can check for yourself here:

http://developers.facebook.com/tools/debug

Facebook cache lifespan of this data is variable, but it's between 24-48hours from my experience.

You -can- however make the cache "invalidate" if you add a portion to your url so that users will share the new one, OR you can provide bit.ly (and the like) links that will have the same effect.

Since it's not actually crawling, you can't force it to delay a scrape (and you shouldn't, as this would create bad user experience - they would wait a while for the scraper to finish and they would be provided with a shareable link that is not pretty). You COULD however trigger manually the scraping at set intervals so as to ensure better user experience (they wouldn't wait for data to be cached) and server load balancing.

Scutari answered 20/10, 2011 at 10:22 Comment(1)
Doesn't really answer the question or give any help dealing with an overload of Facebook traffic. Facebook's documentation of their bot's behavior and the actual behavior diverge massively. See this bug report for examples developers.facebook.com/bugs/1654459311255613Mammalian
S
1

wow all these years later and this is all still relevant. Facebookexternalhit is still a bad actor. They most definitely are crawling based on our logs. This would all be so much easier if they used a different UA for transactional meta data vs crawling!

we are dealing with both googlebot, now googleother, and facebookexternalhit all fighting for dominance over our server resources. So far ALL attempts to resolve by contacting the companies have gone into /dev/null - leaving us with the only choice to block their entire ASNs - NOT cool at all!

Sharma answered 29/6, 2024 at 19:57 Comment(0)
C
0

if you are running on ubuntu server and you are using ufw firewall you may want to try

ufw limit proto tcp from 31.13.24.0/21 port 80 to any

for all of these IP addresses: 31.13.24.0/21 31.13.64.0/18 66.220.144.0/20 69.63.176.0/20 69.171.224.0/19 74.119.76.0/22 103.4.96.0/22 173.252.64.0/18 204.15.20.0/22

as shown here: What's the IP address range of Facebook's Open Graph crawler?

Commonly answered 29/5, 2014 at 18:53 Comment(0)
J
-5

We don't have a crawler. We have a scraper that scrapes meta data on pages that have like buttons/are shared on FB.

Jitney answered 17/10, 2011 at 17:41 Comment(6)
"It depends on what the meaning of the word 'is' is." Thanks for the non-answer. Pulling down 100 pages in a few seconds is a crawl, whatever you want to call it. Clearly this "scraper" goes apenuts and starts pulling down LOTS of pages that have nothing to do with the link being posted. Or FB is stealthily creating a search competitor to Google. Or maybe someone else's crawler is executing external LIKE buttons? Something crazy is going on.Lauricelaurie
I removed open graph and this facebookexternalhit/1.1 stopped messing up my database connection on two sites that have been running smoothly on two different hosts for 8 years each.Insomuch
This isn't a particularly helpful answer. We've seen a substantial amount of crawler/spider-like behaviour from facebook's servers. Requests are made at what could be considered an abusive rate, causing the database to lock up etc.Rida
According to your own documentation you do have a crawler: developers.facebook.com/docs/sharing/webmasters/crawlerFrisse
Please note this longstanding and abhorrent bug in the Facebook system, that sends both scrape traffic with no UA (GROSS) and also floods of scrape traffic that can take down your site developers.facebook.com/bugs/1654459311255613Mammalian
And also gross answer here for 1) not specifying your role in FB in the answer, since you say "we". 2) nitpicking about the language, you could have clarified without being dismissive. 3) Not answering the question at all since clearly the info needed is whether the "scraper" respects crawl-delay. If you think the answer about crawl-delay isn't important because it's a "scraper" you're the kind of Facebook employee that makes dealing with these Facebook bots a nightmare.Mammalian

© 2022 - 2025 — McMap. All rights reserved.