Google Storage or Amazon S3 or Google App Engine BlobStore
Asked Answered
G

4

18

I am going to build a site using Google App Engine. My public site contains thousands of pictures. I want to store these pictures in the Cloud: Google Storage or Amazon S3 or Google App Engine BlobStore. The problem is image hotlinking.

  1. As for Google Storage, I googled and I cant find a way to prevent image hotlinking. (I like its command line tool gsutil very much though)

  2. Amazon S3 has "Query String Authentication" which generates expiring image urls. But this is very bad for SEO, isnt it? Constantly changing the URL would have quite negative affects as it takes upwards of a year to get an image, and its related URL, into Google Images. I am rather sure changing this URL would have an immediate negative affect when GoogleBot comes around to say hi. (UPDATE: A better way to preven image hotlinking in Amazon S3 by referrer is using Bucket Policy. Details here: http://www.naveen.info/2011/03/25/amazon-s3-hotlink-prevention-with-bucket-policies/)

  3. Google App Engine BlobStore? I have to upload the images via Web Interfaces manually and it generates changing urls too. (update: Due to my ignorance about Blobstore, I just made a mistake. By using Google App Engine BlobStore, you can use whatever url to serve the image you want.)

What I need is simple referrer protection: Only show the image when the referrer is my site.

Are there some better ways to prevent image hotlinking. I dont want to file bankruptcy due to the extremely high cost of cloud bandwidth.

UPDATE:

Still difficult to choose from the three, each of them have pros and cons. BlobStore seems to be the ultimate choice.

Guidon answered 1/7, 2011 at 21:17 Comment(3)
I'm not sure, but I'd be surprised if you could get your images into Google Image Search if you prevent hotlinking.Kellsie
@sharth: Good point.I just searched, there is no referrer in Googlebot. Only one agent: Googlebot-Image/1.0.Guidon
Did you succeed in preventing hotlinking? Cheers.Leakage
T
7

The easiest option would be to use the blobstore. You can provide whatever upload interface you want - it's up to you to write it - and the blobstore doesn't constrain your download URLs, only your upload ones. You can serve blobstore images under any URL simply by setting the appropriate headers, or you can use get_serving_url to take advantage of the built-in fast image serving support, which generates cryptic but consistent URLs (but doesn't let you do referer checks).

I would suggest giving some consideration to whether this is a real, practical problem you're facing, though. The bandwidth consumed by a few hotlinked images is pretty minimal by today's standards, and it's not a particularly common practice in the first place. As @sharth points out in the comments, it's likely to impact SEO too, since image search tends to show images in their own windows in addition to linking to the page that hosted them.

Tenuis answered 4/7, 2011 at 1:46 Comment(4)
Is there any command line tool to upload an image to blobstore?Guidon
@Guidon No, but the blobstore APIs are available over remote_api, so you could write one fairly simply.Tenuis
Since you are here, I want to know something about Blobstore. I know there is a 30 secs per request limit in app engine. Will this limit apply when I upload a video to the app engine Blobstore? The max single file size for Blobstore is 2GB, and if I upload via an HTML form, it may take hours. Will the 30 secs per request limit apply?Guidon
@Guidon The 30 second execution time limit only applies to the time your code actually spends executing - which doesn't begin until the user has sent the entire request, and ends as soon as you send your response (before they receive it).Tenuis
B
1

Whenever I get back to coding for statistical web services, I had to generate images and charts dynamically. The images generated would depend on the request parameter, state of data repository, and some header info.

Therefore if I were you, I would write a REST web service to serve the images. Not too difficult. It's pretty ticklish too because if you don't like a particular ip address, you could show cartoon of tongue-out-of-cheek (or animated gif of OBL samba dancing while getting bombed) rather than the image for the data request.

For your case you would check the referer (or referrer) at the http header, right? I am doubtful because people can and will hide, blank out or even fake the referer field in the http header.

So, not only check the referer field, but create a data field where the value changes. The value could be a simple value matching.

During the world war, Roosevelt and Churchill communicated thro encryption. They each had an identical stack of disks, which somehow contained the encryption mechanism. After each conversation, both would discard the disk (and never reused) so that the next time they spoke again, they reach for the next disk on the stack.

Instead of a stack of disks, your image consumers and your image provider would carry the same stack of 32 bit tokens. 32 bits would give you ~4 billion ten minute periods. The stack is randomly sequenced. Since it is well known that "random generators" are not truly random and actually algorithmic in a way which can be predicted if supplied a sufficiently long sequence, you should either use a "true random generator" or resequence the stack every week.

Due to latency issues, your provider would accept tokens from the current period, the last period and the next period. Where period = sector.

Your ajax client (presumably gwt) on your browser would get an updated token from the server every ten minutes. The ajax client would use that token to request for images. Your image provider service would reject a stale token and your ajax client would have to request a fresh token from the server.

It is not a fireproof method but it is shatterproof, so that it could reduce/discourage the number of spam requests (nearly to zero, I presume).

The way I generate "truly random" sequences is again quick and dirty. I further work on an algorithmically generated "random" sequence by spending a few minutes manually throwing in a few monkey wrenches by manually resequencing or deleting values of the sequences. That would mess up any algorithmic predictability. Perhaps, you could write a monkey wrench thrower. But an algorithmic monkey wrench thrower would simply be adding a predictable algorithm above another predictable algorithm which does not reduce the overall predictability at all.

You could further obsessively constrict the situation by employing cyclic redundancy matching as a quick and dirty "encrypted" token matching mechanism.

Let us say you have a circle divided into 8 equidistant sectors. You would have a 3 digit binary number to be able to address anyone of all the 8 sectors. Imagine each sector is further subdivided into 8 subsectors, so that now you will be able to address each subsector with an additional 3 bytes, making a total of six bytes.

You plan to change the matching value every 10 minutes. Your image provider and all your approved consumers will have the same stack of sector addresses. Every ten minutes they throw away the sector address and use the next one. When a consumer sends your provider a matching value, it does not send the sector address but the subsector address. So that as long as your provider receives a subsector address belonging to the currently accepted sector, the provider service would respond with the correct image.

But the subsector address is remapped through an obfuscation sequencing algorithm. so that each subsector address within the same sector do not look similar at all. In that way, not all browsers would receive the same token value or highly similar token value.

Let us say that you have 16bit sector addresses and each sector has 16 bit subsector addresses, making up a 32 bit token. Which means you can afford to have 65536 concurrent browser clients carrying the same token sector but where no two token has the same low predictability value. So that you could assign a token subsector value for every session id. Unless you have more than 65536 concurrent sessions to your image provider service, no two session ids would need to share the same subsector token address. In that way, unless a spammer had access to session id faking equipment/facilities, there would be no way your image provider could be spammed except thro denial of service attack.

Low predictability means that there is low probability for a snooper or peeper to concoct an acceptable token to spam your image provider service.

Certainly, normal bots would not be able to get thro - unless you had really offended the ANNONYMOUS group and they decided to spam your server out of sheer fun. And even then if you had thrown monkey wrenches into the sector address stack and subsector maps, it would be really difficult to predict a next token.

BTW, Cyclic Redundancy matching is actually an error correction technique and not so much an encryption technique.

Besides answered 2/7, 2011 at 3:23 Comment(4)
LOL What are you talking about? FYI My English sucksGuidon
Wow. 1) The point of hotlink prevention is to prevent users from linking directly to your resources by making it unusable by other users. The users who are sending the referer headers are not your adversaries, the people who linked to your images are, and they have no control over other users' browsers. 2) I'm pretty sure Roosevelt and Churchill didn't use disks, since they weren't invented until 30 years after the end of world war 2. 3) What you're talking about is One Time Pads, and completely irrelevant to the question at hand. 4) Don't invent your own crypto. Just don't.Tenuis
It's been drawn to my attention that you probably meant vinyl records when you said 'discs', which is accurate. It's still pretty much irrelevant to the OP's problem, though.Tenuis
Is this an ironic way of saying "don't worry about hot-linking"?Polypus
B
0

Simpler version of geek's essay, build a handler in google app engine to fetch and server the images. You can modify your headers to specify png or whatever, but you're returning the image from another location. You can then examine your request referrer information in the handler and take appropriate action if somebody is trying to access that image "hotlinked". Of course, because you're never exposing the actual image, it would be impossible to hotlink. =)

Bidwell answered 2/7, 2011 at 16:45 Comment(5)
And fetch and return the image from a third-party service on every response? Sure, if you love high bandwidth bills, do this.Tenuis
I implied google app engine blobstore, since as far as I know short of storing static images through app deployment that's the only way I know of storing images there. I guess you have a point in that I didn't specifically say blobstore since that was part of his question...Bidwell
Then you're not really "returning the image from another location", are you? That was what led me to believe you were talking about fetching the image from elsewhere.Tenuis
I meant to say that you can specify "examplewebsite.com/images/image1234.png" when the image's url is whatever the blobstore url is. Google's bandwidth charges are very reasonable for small to medium website's to serve images directly imho. =)Bidwell
Well, the blobstore lets you serve images under any URL you want - the only 'blobstore URLs' are upload URLs and get_serving_url ones. I agree that App Engine's bandwidth charges are reasonable - I was more worried about the OP paying three times that for every request.Tenuis
A
0

You should know that the File API is still experimental, check out this issue:

http://code.google.com/p/googleappengine/issues/detail?id=6888#c20

I'm working on a startup which is moving out from Blobstore to Amazon S3

Astrometry answered 20/3, 2012 at 14:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.