The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.
In some cases, it is accessing the same og:image resource multiple times over the space of 1-5 minutes. In one example - the crawler accessed the same image 12 times over the course of 3 minutes using 12 different IP addresses.
I only had to log requests for 10 minutes before I caught the following example:
List of times and crawler IP addresses for one image:
2018-03-30 15:12:58 - 66.220.156.145
2018-03-30 15:13:13 - 66.220.152.7
2018-03-30 15:12:59 - 66.220.152.100
2018-03-30 15:12:18 - 66.220.155.248
2018-03-30 15:12:59 - 173.252.124.29
2018-03-30 15:12:15 - 173.252.114.118
2018-03-30 15:12:42 - 173.252.85.205
2018-03-30 15:13:01 - 173.252.84.117
2018-03-30 15:12:40 - 66.220.148.100
2018-03-30 15:13:10 - 66.220.148.169
2018-03-30 15:15:16 - 173.252.99.50
2018-03-30 15:14:50 - 69.171.225.134
What the og:image is according to Facebook's documentation:
The URL of the image that appears when someone shares the content to Facebook. See below for more info, and check out our best practices guide to learn how to specify a high quality preview image.
The images that I use in the og:image have an Expires header set to +7 days in the future. Lately, I changed that to +1 year in the future. Neither setting seems to make any difference. The headers that the crawler seems to be ignoring:
Cache-Control: max-age=604800
Content-Length: 31048
Content-Type: image/jpeg
Date: Fri, 30 Mar 2018 15:56:47 GMT
Expires: Sat, 30 Mar 2019 15:56:47 GMT
Pragma: public
Server: nginx/1.4.6 (Ubuntu)
Transfer-Encoding: chunked
X-Powered-By: PHP/5.5.9-1ubuntu4.23
According to Facebook's Object Properties documentation, the og:ttl property is:
Seconds until this page should be re-scraped. Use this to rate limit the Facebook content crawlers. The minimum allowed value is 345600 seconds (4 days); if you set a lower value, the minimum will be used. If you do not include this tag, the ttl will be computed from the "Expires" header returned by your web server, otherwise it will default to 7 days.
I have set this og:ttl property to 2419200, which is 28 days in the future.
I have been tempted to use something like this:
header("HTTP/1.1 304 Not Modified");
exit;
But my fear would be that Facebook's Crawler would ignore the header and mark the image as broken - thereby removing the image preview from the shared story.
A video showing the rate at which these requests from the Crawler are coming in.
Is there a way to prevent the crawler from coming back to hit these resources so soon?
Example code showing what my open graph and meta properties look like:
<meta property="fb:app_id" content="MyAppId" />
<meta property="og:locale" content="en_GB" />
<meta property="og:type" content="website" />
<meta property="og:title" content="My title" />
<meta property="og:description" content="My description" />
<meta property="og:url" content="http://example.com/index.php?id=1234" />
<link rel="canonical" href="http://example.com/index.php?id=1234" />
<meta property="og:site_name" content="My Site Name" />
<meta property="og:image" content="http://fb.example.com/img/image.php?id=123790824792439jikfio09248384790283940829044" />
<meta property="og:image:width" content="940"/>
<meta property="og:image:height" content="491"/>
<meta property="og:ttl" content="2419200" />
X-Powered-By: PHP/...
- so you are serving dynamically created images via PHP then, or is this just for the purpose of dynamically adding those headers (and no actual image manipulation/resizing is going on, and the image data is just read and piped through)? Does this respond in a reasonably quick fashion, or is it more a case of, "yeah, gimme a sec ..."? Does anything change when you switch this out for a static image, where you let the web server handle all header stuff? I see noContent-Length
header above, did you just leave this out here, or is your system not sending one? – Perithecium