Cloudfront Custom Origin Is Causing Duplicate Content Issues
Asked Answered
K

2

13

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.

Main site: www.mainsite.com

  1. static1.mainsite.com
  2. static2.mainsite.com

Sample page: www.mainsite.com/summary/page1.htm

This page calls an image from static1.mainsite.com/images/image1.jpg

If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg

This all works fine.

The problem is that google alert has reported the page as being found at both:

The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.

I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.

But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.

Questions then are:

1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?

Thanks for your help.

Joe

Kalagher answered 6/1, 2012 at 4:26 Comment(0)
P
32

[I know this thread is old, but I'm answering it for people like me who see it months later.]

From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.

1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.

2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.

3) Go to Behaviors and click Create Behavior: Path Pattern: robots.txt Origin: (your new bucket)

4) Set the robots.txt behavior at a higher precedence (lower number).

5) Go to invalidations and invalidate /robots.txt.

Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.

Another domain/subdomain will also work in place of a bucket, but why go to the trouble.

Pittel answered 16/5, 2013 at 5:10 Comment(0)
H
2

You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.

In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

Harr answered 1/2, 2012 at 19:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.