Skip to main content
 
developerworks > Community >  Dashboard > Patrick Mueller > Amazon S3 vs RESTful Collections
developerWorks
Log In   View a printable version of the current page.
Overview New to Forums Wikis
Amazon S3 vs RESTful Collections
Added by pmuellr, last edited by pmuellr on Oct 16, 2006  (view change)
Labels: 
(None)

Let's start with some references:

Amazon S3

This is a service that Amazon provides to store data. It has both RESTy and SOAPy interfaces. The entry point in Amazon for info on S3 is here: http://aws.amazon.com/s3. The current reference docs are here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=123&categoryID=48.

RESTful Collections

Joe Gregorio has been blogging on this, and here's a good primer on what this is all about: http://bitworking.org/news/wsgicollection.

What are we doing here?

I'm a big believer in RESTy web services, and the collections pattern as described by Joe seems like a good stab at applying some order in the otherwise unordered world of REST.

Amazon S3 provides a lot of the same functionality as RESTful collections, and it's a real, live, commercial ($$$) implementation, backed by a pretty stable company.

I thought it would be interesting to compare/constrast these.

Is Amazon S3 already a RESTful Collection?

Here's some brief details on the S3 service, if you don't already know about it.

  • each account holder can create buckets, which are used to hold objects
  • a bucket may contain an unlimited number of objects, stored as a binary blob, each of which may be from 1 byte to 5 gigabytes in size
  • buckets and objects are named, and are combined to form a URI at which to access a resource (see table below).

Let's look at the operations you can perform against your S3 account. I will express these as the HTTP verb and URI pattern invoked against.

GET / get a list of all your buckets
PUT /{bucket} create a new bucket
GET /{bucket} list the contents of the bucket
DELETE /{bucket} delete the bucket
PUT /{bucket}/{object} create/update an object
GET /{bucket}/{object} get the contents of the object
DELETE /{bucket}/{object} delete the object

Compare this to the same table for RESTful collections.

GET /{collection} get a list of the objects in the collection
POST /{collection} create a new object in a collection
GET /{collection}/{object} get the contents of an object
PUT /{collection}/{object} update the contents of an object
DELETE /{collection}/{object} delete the object

S3 adds:

  • getting a list of your collections (buckets)
  • creating a new collection (bucket)
  • deleting a collection (bucket)

And the big difference is the different pattern for object creation.

Different object creation pattern

Note that the object creation style is different than the RESTful collection style, which does a POST against a collection, instead of a PUT against the object. There's something nice about both styles. The POST to the collection style is nice because it separates create from update, which I think is important. On the other hand, the PUT of the object with it's name is nice because you don't need to depend on the server to tell you what the resulting uri is, via the Location: header that's output. How does the server determine that resulting uri, especially if you want it to be human readable, in the POST case?

In the end, the PUT style seems a bit more symmetric to me. The update vs. create issue is real through. The scenario is that two people try to create the same named object at the same time. With the PUT style, the first one operates as a create, the second as an update against the same object, and the system doesn't know any better. The object sent in the first received PUT is basicaly thrown away. In the POST style, two different objects are created.

In theory, this is solvable by using If-Match and If-None-Match headers on the PUT. Specifically, when you attempting to create a new item, use the header

If-None-Match: *

This means that the PUT request should fail with a 412 Precondition Failed response if the resource already exists.

When you are attempting to update an existing item, you should first GET the item, and use the ETag returned in the response in the header

If-Match: [ETag returned from GET]

This means the PUT request should fail if the resource has been updated since you last performed the GET.

A note in the S3 doc indicates the ETag is an MD5 of the content of the resource, so 'updated' in the sentence above implies the bits of the entity have changed. If for some reason the entity was updated with the exact same bits that it previously had, the ETag will not have changed, but then, that's probably ok anyway.

Curiously, the S3 doc indicates that GET requests can use the various forms of the If- headers, but those headers aren't documented for PUT. Some experimentation here is required. If in fact S3 does not support the If- headers on PUT, then other measures will need to be taken to ensure objects are not inadvertantly overwritten. And those other measures may include using the Amazon Simple Queue Service (SQS).

More S3 issues

The issues don't stop here. Here are some S3 constraints:

  • S3 account holders can only create up to 100 buckets
  • buckets names must be globally unique

The first constraint means that we can't just use a bucket for every collection we might want to support, because we'll run out of buckets. Each bucket is more like a relational database than a table in a database, basically. The second constaint means that we can't use a 'nice' name for our buckets, because only one person will be able to create a bucket named 'recipes'.

Additional S3 capabilities

Given what I've said, you might thing that think that this is a bit hopeless as a general purpose RESTful collection. You're going to have odd constraints on the name of your collections, and you can't really have very many collections to start with. However, there are some additional capabilities in the 'list the contents of the bucket' functionality (GET /{bucket}) which point to a different approach.

The additional capabilities are:

  • the ability to only list objects whose names begin with a specific prefix
  • the ability to 'roll up' a list of prefixes that are shared by a set of objects

To describe these, let's say I create a bucket which contains a set of ToDo lists (like Ta-Da Lists); each ToDo list is named with a 'simple' name (constrained to a path element; ie, a directory name); and each ToDo item also has a 'simple name', which is prefixed by the ToDo list it's part of. And we'll do this in a directory-style naming, where the ToDo list and ToDo item are separated by a /. Here's an example of what the 'list all the objects in my bucket' might return.

home/pick-up-milk
home/trim-nose-hair
work/submit-tps-report
work/beg-for-a-raise

In this example, I have two ToDo lists, 'home' and 'work', each having two ToDo items.

The first additional capability, only listing objects with a specific prefix, can be used in such a way that I can issue a request like 'list all the objects starting with "/home"'. And I'd get

home/pick-up-milk
home/trim-nose-hair

Nice. Except, before I even do that, how do I know that 'home' exists? This is what the 'rollup' capability does, which is called 'delimiter' access in the S3 docs. I can issue a request like 'list all the objects with delimiter "/". And I'd get

home
work

Prefix and delimiter can work together for arbitrary depth 'trees'. Perhaps I'd like to be more elaborate and have ToDo lists that are 'current' and some for the 'future'. For example, a complete list of my bucket might be

current/home/pick-up-milk
current/home/trim-nose-hair
current/work/submit-tps-report
current/work/beg-for-a-raise
future/retirement/buy-a-jet-car
future/retirement/take-a-nap

I can issue a request list 'list all objects prefixed with "current" using delimiter "/"' and get back

current/home
current/work

(aside: The names returned may not be what you are actually getting from S3, but you get the gist; I need to actually get off my duff and write some code to get the exact details.)

In general, if you name your objects appropriately, you can use prefix and delimiter to do tree-based resource listing. Can you say GET NEXT IN PARENT? I thought you could.

(aside: I've been having chats with some folks recently, pleasantly reminiscing the old IMS DL/I hierarchical database days. It's coming back, I tell ya, it's coming back!)

Additionally, the 'list the contents of the bucket' functionality includes some pretty nice pagination control, which is important when you are dealing with a lot of objects.

Conclusion

Amazon S3 provides a lot of the same functionality as described by RESTful collections, with some slightly different usage patterns (object creation), some practical limitations (the number of buckets, bucket names), and some additional capabilities (name-based searching in the collection). It's something I want to get started playing with, especially since it's pretty inexpensive to play, as long as you're not storing a lot of data or accessing it a lot.

Two things that I didn't talk about, but are quite interesting, are the authentication / authorization story with S3, and exception story. Go read up. Though the auth story seems a bit overkill, the exception story is something we should start talking about w/r/t RESTful collections.

The If- header support on the PUT method used for object creation is something that I need to look into via some experiments. If it's not supported, I'll propose (forum?) that they do. And figure out the work-around.

One interesting exercise would be to implement the RESTful collections pattern on top of S3, which would presumably be a fairly thin veneer. One thing in particular that would need to happen is to convert the XML returned by something like the 'list objects' functionality, potentially, into some other content-negotiated content; JSON and HTML being the obvious examples, because I don't like XML, and I love JSON and HTML.

Updates

 

2006-10-16

I've uploaded my little command-line S3 utility, here: http://www.muellerware.org/projects/s3u/index.html.

As an update to my claim that If- doesn't work with S3, this thread claims that it does work, for HEAD. But not for PUT. Yet.

 

2006-10-15

So, my fears were right; S3 does not currently support the use of If-Match and If-None-Match for the 'put object' functionality. Here's a message flow using If-None-Match to try to enforce the PUT of an object which doesn't already exist.

-----------------------------------------------------
Request
-----------------------------------------------------
PUT https://s3.amazonaws.com/org.muellerware.testing/a/a.html
Server: s3.amazonaws.com
Date: Mon, 16 Oct 2006 05:09:17 GMT
Authorization: AWS xxxx:yyyy
x-amz-acl: private
If-None-Match: *
Content-Length: 19

-----------------------------------------------------
Response
-----------------------------------------------------
HTTP/1.1 501 Not Implemented
x-amz-request-id: xyz
x-amz-id-2: xyz
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Mon, 16 Oct 2006 05:09:25 GMT
Connection: close
Server: AmazonS3

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NotImplemented</Code><Message>A header you provided implies 
functionality that is not implemented</Message><RequestId>xyz
</RequestId><Header>If-None-Match</Header><HostId>xyz</HostId></Error>