Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times
Asked Answered
T

10

25

The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.

In some cases, it is accessing the same og:image resource multiple times over the space of 1-5 minutes. In one example - the crawler accessed the same image 12 times over the course of 3 minutes using 12 different IP addresses.

I only had to log requests for 10 minutes before I caught the following example:

List of times and crawler IP addresses for one image:

2018-03-30 15:12:58 - 66.220.156.145
2018-03-30 15:13:13 - 66.220.152.7
2018-03-30 15:12:59 - 66.220.152.100
2018-03-30 15:12:18 - 66.220.155.248
2018-03-30 15:12:59 - 173.252.124.29
2018-03-30 15:12:15 - 173.252.114.118
2018-03-30 15:12:42 - 173.252.85.205
2018-03-30 15:13:01 - 173.252.84.117
2018-03-30 15:12:40 - 66.220.148.100
2018-03-30 15:13:10 - 66.220.148.169
2018-03-30 15:15:16 - 173.252.99.50
2018-03-30 15:14:50 - 69.171.225.134

What the og:image is according to Facebook's documentation:

The URL of the image that appears when someone shares the content to Facebook. See below for more info, and check out our best practices guide to learn how to specify a high quality preview image.

The images that I use in the og:image have an Expires header set to +7 days in the future. Lately, I changed that to +1 year in the future. Neither setting seems to make any difference. The headers that the crawler seems to be ignoring:

Cache-Control: max-age=604800
Content-Length: 31048
Content-Type: image/jpeg
Date: Fri, 30 Mar 2018 15:56:47 GMT
Expires: Sat, 30 Mar 2019 15:56:47 GMT
Pragma: public
Server: nginx/1.4.6 (Ubuntu)
Transfer-Encoding: chunked
X-Powered-By: PHP/5.5.9-1ubuntu4.23

According to Facebook's Object Properties documentation, the og:ttl property is:

Seconds until this page should be re-scraped. Use this to rate limit the Facebook content crawlers. The minimum allowed value is 345600 seconds (4 days); if you set a lower value, the minimum will be used. If you do not include this tag, the ttl will be computed from the "Expires" header returned by your web server, otherwise it will default to 7 days.

I have set this og:ttl property to 2419200, which is 28 days in the future.

I have been tempted to use something like this:

header("HTTP/1.1 304 Not Modified"); 
exit;

But my fear would be that Facebook's Crawler would ignore the header and mark the image as broken - thereby removing the image preview from the shared story.

A video showing the rate at which these requests from the Crawler are coming in.

Is there a way to prevent the crawler from coming back to hit these resources so soon?

Example code showing what my open graph and meta properties look like:

<meta property="fb:app_id" content="MyAppId" />
<meta property="og:locale" content="en_GB" />
<meta property="og:type" content="website" />
<meta property="og:title" content="My title" />
<meta property="og:description" content="My description" />
<meta property="og:url" content="http://example.com/index.php?id=1234" />
<link rel="canonical" href="http://example.com/index.php?id=1234" />
<meta property="og:site_name" content="My Site Name" />
<meta property="og:image" content="http://fb.example.com/img/image.php?id=123790824792439jikfio09248384790283940829044" />
<meta property="og:image:width" content="940"/>
<meta property="og:image:height" content="491"/>
<meta property="og:ttl" content="2419200" />
Tsarevna answered 30/3, 2018 at 16:2 Comment(11)
What does the FB debug tool say, is it able to properly read those resources without any issues in the first place? (Just saying, for any TTL/ caching directives to work, they would need to have been read correctly in the first place.)Perithecium
@Perithecium I've tested the shared links on their Sharing Debugger and all of the Open Graph properties are being scraped just fine.Tsarevna
@Perithecium What I'm dealing with here: scontent-lhr3-1.xx.fbcdn.net/v/t39.2087-6/…Tsarevna
X-Powered-By: PHP/... - so you are serving dynamically created images via PHP then, or is this just for the purpose of dynamically adding those headers (and no actual image manipulation/resizing is going on, and the image data is just read and piped through)? Does this respond in a reasonably quick fashion, or is it more a case of, "yeah, gimme a sec ..."? Does anything change when you switch this out for a static image, where you let the web server handle all header stuff? I see no Content-Length header above, did you just leave this out here, or is your system not sending one?Perithecium
@Perithecium The images are dynamically generated. I resize them down to cut down on the filesize. Example: I just loaded one image and it was 44.1KB. It took 400ms to download (server is in the US and I am in Ireland). I have added the Content-Length header to my my images. My haproxy stats are showing me an average response time of about 50ms.Tsarevna
Using Pingdom, I can download the image in New York in 200ms.Tsarevna
Does the url "og:image" change or is it constant? How is it formed? Can you add an example of all the meta properties?Whence
The URL og:image is constant - it doesn't change unless I make a structural change to the site myself. I've added a sample of the open graph and meta properties that I use.Tsarevna
May be the get parameter in the image link is messing up, have you tried a static image?Learned
Can you try removing the "/" from the meta tags ending "/>"?Whence
Somewhat old but possibly relevant: #11522298Guillaume
L
26

After I tried almost everything else with caching, headers and what not, the only thing that saved our servers from "overly enthusiastic" Facebook crawler (user agent facebookexternalhit) was simply denying the access and sending back HTTP/1.1 429 Too Many Requests HTTP response, when the crawler "crawled too much".

Admittedly, we had thousands of images we wanted the crawler to crawl, but Facebook crawler was practically DDOSing our server with tens of thousands of requests (yes, the same URLs over and over), per hour. I remember it was 40 000 requests per hour from different Facebook's IP addresses using te facebookexternalhit user agent at one point.

We did not want to block the the crawler entirely and blocking by IP address was also not an option. We only needed the FB crawler to back off (quite) a bit.

This is a piece of PHP code we used to do it:

.../images/index.php

<?php

// Number of requests permitted for facebook crawler per second.
const FACEBOOK_REQUEST_THROTTLE = 5;
const FACEBOOK_REQUESTS_JAR = __DIR__ . '/.fb_requests';
const FACEBOOK_REQUESTS_LOCK = __DIR__ . '/.fb_requests.lock';

function handle_lock($lockfile) {
    flock(fopen($lockfile, 'w'), LOCK_EX);
}

$ua = $_SERVER['HTTP_USER_AGENT'] ?? false;
if ($ua && strpos($ua, 'facebookexternalhit') !== false) {

    handle_lock(FACEBOOK_REQUESTS_LOCK);

    $jar = @file(FACEBOOK_REQUESTS_JAR);
    $currentTime = time();
    $timestamp = $jar[0] ?? time();
    $count = $jar[1] ?? 0;

    if ($timestamp == $currentTime) {
        $count++;
    } else {
        $count = 0;
    }

    file_put_contents(FACEBOOK_REQUESTS_JAR, "$currentTime\n$count");

    if ($count >= FACEBOOK_REQUEST_THROTTLE) {
        header("HTTP/1.1 429 Too Many Requests", true, 429);
        header("Retry-After: 60");
        die;
    }

}

// Everything under this comment happens only if the request is "legit". 

$filePath = $_SERVER['DOCUMENT_ROOT'] . $_SERVER['REQUEST_URI'];
if (is_readable($filePath)) {
    header("Content-Type: image/png");
    readfile($filePath);
}

You also need to configure rewriting to pass all requests directed at your images to this PHP script:

.../images/.htaccess (if you're using Apache)

RewriteEngine On
RewriteRule .* index.php [L] 

It seems like the crawler "understood this" approach and effectively reduced the attempt rate from tens of thousands requests per hour to hundreds/thousands requests per hour.

Lachrymator answered 9/4, 2018 at 11:33 Comment(5)
Awarding this answer the bounty because of its solution. Unfortunately, it seems like there is no way to tell the crawler to back off other than to resort to this kind of method. I guess that I will have to test out whether this method will result in my images being removed from the preview.Tsarevna
It may be a better approach to rate limit at server level instead of app level. When rate limiting at application level you add overhead.Trainor
@Emil That is true - it would be better to deal with this at "server level", unfortunately I've not been able to find Apache module that would allow me to do user-agent-based rate limiting. Also, this really is a trivial piece of code and I dare to say the overhead is relatively negligible.Lachrymator
@Lachrymator nginx has rate limiting module. Serving assets via PHP is always much less efficient than doing it directly by webserver (especially if you're constantly locking single file).Thumbsdown
The Facebook developers guide suggests to "Use the og:ttl object property to limit crawler access if our crawler is being too aggressive." - Although this does already seem to be present in the OPs code in the question. (?)Mnemosyne
T
6

I received word back from the Facebook team themselves. Hopefully, it brings some clarification to how the crawler treats image URLs.

Here it goes:

The Crawler treats image URLs differently than other URLs.

We scrape images multiple times because we have different physical regions, each of which need to fetch the image. Since we have around 20 different regions, the developer should expect ~20 calls for each image. Once we make these requests, they stay in our cache for around a month - we need to rescrape these images frequently to prevent abuse on the platform (a malicious actor could get us to scrape a benign image and then replace it with an offensive one).

So basically, you should expect that the image specified in og:image will be hit 20 times after it has been shared. Then, a month later, it will be scraped again.

Tsarevna answered 9/4, 2018 at 21:12 Comment(1)
Quite interesting stuff. Thanks :)Lachrymator
T
4

Sending blindly 304 Not Modified header does not have much sense and can confuse Facebook's crawler even more. If you really decide to just block some request you may consider 429 Too Many Requests header - it will at least clearly indicate what the problem is.

As a more gentle solution you may try:

  • Add Last-Modified header with some static value. Facebook's crawler may be clever enough to detect that for constantly changing content it should ignore Expires header but not clever enough to handle missing header properly.
  • Add ETag header with proper 304 Not Modified support.
  • Change Cache-Control header to max-age=315360000, public, immutable if the image is static.

You may also consider saving cached image and serving it via webserver without involving PHP. If you change URLs to something like http://fb.example.com/img/image/123790824792439jikfio09248384790283940829044 You can create fallback for nonexistent files by rewrite rules:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^img/image/([0-9a-z]+)$ img/image.php?id=$1 [L]

Only first request should be handled by PHP, which will save cache for requested URL (for example in /img/image/123790824792439jikfio09248384790283940829044). Then for all further requests webserver should take care of serving content from cached file, sending proper headers and handling 304 Not Modified. You may also configure nginx for rate limiting - it should be more efficient than delegating serving images to PHP.

Thumbsdown answered 7/4, 2018 at 21:38 Comment(1)
I will give this a go!Tsarevna
R
2

It would appear the Facebook's crawlers aren't always that respectful. In the past we've implemented the suggestion here: excessive traffic from facebookexternalhit bot.

It's not the best solution as it would be nice for Facebook to limit the rate of requesting but clearly they don't do that.

Reynold answered 2/4, 2018 at 20:5 Comment(1)
It seems this is still a problem and Facebook appear to be aware of the issue, in addition to the OP's original comment about the FB crawler, it also seems FB sends a lot of traffic without the relevant crawler headers (in other words, no user agent) which is extremely annoying; Bug report here: developers.facebook.com/support/bugs/1654459311255613Whittington
A
1

I wanted to stop Facebook from crawling my site, while still allowing it to fetch a page when someone actually shared a link to Facebook.

Users can upload their own content so my site can have hundreds of thousands of pages, all of them linked via document to user to followers to their documents, etc. Facebook was crawling like crazy.

I ended up creating a special template with no links on it. I serve this when the user agent contains "facebookexternalhit". That way the page still has the same content, but Facebook stops there.

It took a few days to slow down, presumably because Facebook had cached links to visit, but the traffic is now 5% of what it was. Shared links still correctly show the thumbnail and page title as Facebook fetches the needed data when the link is shared.

Edit: I found that Facebook doesn't just use user agent with "facebookexternalhit". I started getting a lot of hits from Facebook IPs using user agent "Python/3.10 aiohttp/3.9.3". I therefore serve the no-link page when:

  • User agent contains facebookexternalhit
  • IP address contains "::face:"
Achromatize answered 13/6 at 17:3 Comment(0)
T
0

According to Facebook documentation only Facebot crawler respects the crawling directives. However they also suggest this

You can target one of these user agents to serve the crawler a nonpublic version of your page that has only metadata and no actual content. This helps optimize performance and is useful for keeping paywalled content secure.

Some people suggest to rate limit the access for facebookexternalhit however I doubt that is a good idea since it may prevent the crawler to update the content.

Seeing multiple hits from different IPs but the same bot may be acceptable, depending on their architecture. You should check how often the same resource gets crawled. og:ttl is what the documentation recommends and should help.

Trainor answered 7/4, 2018 at 16:18 Comment(1)
Also it looks like Facebook team doesn't really care how their crawler affects your server.Trainor
C
0

If the FB crawlers ignore your cache headers, adding the "ETag" header could be used in this case to return correct 304 responses and reduce the load of your server.

The first time you generate an image, calculate the hash of that image (for example using md5) as the "ETag" response header. If your server receives a request with the "If-None-Match" header, check if you already have returned that hash. If the answer is yes, return a 304 response. If not, generate the image.

Checking if you already have returned a given hash (while avoiding to generate the image again) means that you'll need to store the hash somewhere... Maybe saving the images in a tmp folder and using the hash as the file name?

More info about "ETag" + "If-None-Match" headers.

Cruciferous answered 7/4, 2018 at 18:4 Comment(1)
Are you sure Facebook crawler follow header tags?Trainor
A
0

Facebook documentation specifically states "Images are cached based on the URL and won't be updated unless the URL changes.". This means it doesn't matter which headers or meta tags you add to your page, the bot is supposed to cache the image anyway.

This made me think:

  1. Does each user share a slightly different URL of your page? This will cause the share image to get re-cached each time.
  2. Is your share image accessed using a slightly different URL?
  3. Maybe the image is being linked differently somewhere?

I'd monitor the page logs and see exactly what happens - if the page URL or the image URL is even slightly different, the caching mechanism won't work. Luckily, this doesn't seem like a headers/tags type of issue.

Asparagine answered 8/4, 2018 at 9:12 Comment(2)
As the OP mentioned the URL doesn't change.Trainor
Hi Walter. The feedback I've received from Facebook says differently. Please see my answer here: https://mcmap.net/q/529871/-facebook-crawler-is-hitting-my-server-hard-and-ignoring-directives-accessing-same-resources-multiple-times They hit the same image over 20 times because they have 20 different regions. They then return to recrawl that image a month later. My images all have the same URL unless I make a site change that modifies them.Tsarevna
A
0

@Nico suggests

We had same problems on our website/server. The problem was the og:url metatag. After removing it, the problem was solved for most facebookexternalhit calls.

So you could try removing that and see if it fixes the problem

Authority answered 9/4, 2018 at 7:18 Comment(1)
This is not a good idea since Facebook associates all activity for an URL with the value of that tag.Trainor
I
0

I am also facing the exact same problem with facebook crawler.

In my case I have a WordPress site and I found this great plugin that solved my issue with performance. WordPress Plugin

I am using it with 60 seconds of permitted traffic between each hit from facebookexternalhit

Idem answered 26/1 at 9:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.