rubyamazon-s3http-headersnoindex

Setting noindex on Amazon S3 objects


We have some publicly shared S3 files that we want to make sure won't be indexed by Google. I can't seem to find any documentation on how to do this. Is there a way to set a "noindex" x-robots-tag response header on individual S3 objects?

(We're using the Ruby AWS client)


Solution

  • There does not appear to be a way to do this.

    Only certain headers from an S3 PUT object request are documented as being returned when the object is fetched.

    http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html

    Anything else you send appears to be simply disregarded, as long as it doesn't actually invalidate the request.

    Actually, that's what I thought before researching this, and it's almost true.

    The documentation here seems incomplete, and elsewhere suggests the following request headers, if sent with the upload, will appear in the download:

    Cache-Control
    Content-Disposition
    Content-Encoding
    Content-Type
    x-amz-meta-*
    

    Other headers are listed at the latter link, but some of these like Expect wouldn't make sense on a GET request, so they logically wouldn't appear.

    So far, this is all consistent with my experience with S3.

    If you send a random but not-invalid header with your request, it's ignored. Example:

    X-Foo: bar
    

    S3 seems to accepts this on upload, but discards it (presumably doesn't store it)... downloading the object does not return the X-Foo header.

    But X-Robots-Tag appears to be an undocumented exception to this.

    Uploading a file with X-Robots-Tag: noindex (for example) does indeed result in the same header and value being returned with the object when you GET it.

    Unless somebody can cite the documentation that explains why this works, we're operating in distinctly undocumented territory.

    But, if you're interested in going there, the simple answer appears to be, you just add this header to the HTTP PUT request you send to the REST API to upload the object.

    "Not so fast," you say, "I'm using the Ruby SDK." Indeed. The AWS Ruby client seems to be too "helpful" to let you get away with this, at least, not easily. The docs there show how to add "metadata" --

    :metadata (Hash) — A hash of metadata to be included with the object. These will be sent to S3 as headers prefixed with x-amz-meta. Each name, value pair must conform to US-ASCII.

    Well, that's not going to work, because you'd get x-amz-meta-x-robots-tag.

    How do you set other headers in the upload? Every other header you'd normally set is an element of the options hash, like :cache_control, which turns into Cache-Control: in the upload request. Unless they're blindly applying the keys from that hash to the upload transaction (which would be terrible design combined with excellent luck) then you may not have a straightforward way to get here from there. I can't be much more specific, because the only I really know about Ruby is the same thing I know about Java -- from what I've seen of it, I don't like it. :)

    But X-Robots-Tag does appear to be a custom header S3 supports, to some extent, without clear documentation of that fact. It's, at least, accepted by the REST API.

    Failing the above, you can manually add this header to the metadata in the S3 console after uploading the object. (Note, X-Foo: Bar doesn't work from the S3 console, either -- it's silently discarded, with no error -- but X-Robots-Tag: works fine).


    You can also, of course, put a publicly-readable robots.txt file (with the appropriate directives in it) in the root of the bucket. Depending on your cobtent mix, path hierarchy, and other factors, that isn't (perhaps) as simple as selectively setting headers, but if the entire bucket is comprised of information you don't want indexed, it should easily accomplish what you want, since content should not be indexed if disallowed in robots.txt, even when a search spider follows a link to it from another site -- every domain (and subdomain)'s robots.txt file stands alone.