Ideal place to store Binary data that can be rendered by calling a url

Asked 2/12, 2011 at 14:44 Answered 2/12, 2011 at 19:31

I am looking for an ideal (performance effective and maintainable) place to store binary data. In my case these are images. I have to do some image processing,scale the images and store in a suitable place which can be accesses via a RESTful service.

From my research so far I have a few options, like:

NoSql solution like MongoDB,GridFS
Storing as files in a file system in a directory hierarchy and then using a web server to access the images by url
Apache Jackrabbit Document repository
Store in a cache something like Memcache,Squid Proxy

Any thoughts of which one you would pick and why would be useful or is there a better way to do it?

Klimt answered 2/12, 2011 at 14:44 Comment(0)

Just started using GridFS to do exactly what you described.

From my experience thus far, the main advantage to GridFS is that it obviates the need for a separate file storage system. Our entire persistency layer is already put into Mongo, and so the next logical step would be to store our filesystem there as well. The flat namespacing just rocks and allows you a rich query language to fetch your files based off whatever metadata you want to attach to them. In our app we used an 'appdata' object that embedded all the ownership information, ensure

Another thing to consider with NoSQL file storage, and especially GridFS, is that it will shard and expand along with your other data. If you've got your entire DB key-value store inside the mongo server, then eventually if you ever have to expand your server cluster with more machines, your filesystem will grow along with it.

It can feel a little 'black box' since the binary data itself is split into chunks, a prospect that frightens those used to a classic directory based filesystem. This is alleviated with the help of admin programs like RockMongo.

All in all to store images in GridFS is as easy as inserting the docs themselves, most of the drivers for all the major languages handle everything for you. In our environment we took image uploads at an endpoint and used PIL to perform resizing. The images were then fetched from mongo at another endpoint that just output the data and mimetyped it as a jpeg.

Best of luck!

EDIT:

To give you an example of a trivial file upload with GridFS, here's the simplest approach in PyMongo, the python library.

from pymongo import Connection
import gridfs

binary_data = 'Hello, world!'

db = Connection().test_db
fs = gridfs.GridFS(db)
#the filename kwarg sets the filename in the mongo doc, but you can pass anything in
#and make custom key-values too.
file_id = fs.put(binary_data, filename='helloworld.txt',anykey="foo")
output = fs.get(file_id).read()
print output 
>>>Hello, world!

You can also query against your custom values if you like, which can be REALLY useful if you want your queries to be based off custom information relative to your application.

try:
  file = fs.get_last_version({'anykey':'foo'})
  return file.read()
catch gridfs.errors.NoFile:
  return  None

These are just some simple examples, and the drivers for alot of the other languages (PHP, Ruby etc.) all have cognates.

Malamut answered 2/12, 2011 at 15:6 Comment(2)

Thanks for sharing, really appreciate it. Do you think reading from disk I/O is more expensive or just having all the data in one place was the reason for to have it in mogo and how is it performing so far? – Klimt 2/12, 2011 at 15:56

File IO time didn't really factor in our decision, though for reference the fetch time is comparable to a standard indexed query in sql. Since the volume of files is extremely high the attractions of having one big namespace that could be sharded horizontally was the main reason. Using GridFS makes it so that directory structure is no longer an issue, and your files can be fetched and inserted using the API drivers. It worked great in a RESTful app where the url requests determined response. – Malamut 2/12, 2011 at 16:2

I would go for jackrabbit in combination with its REST framework sling http://sling.apache.org

Sling allows you to upload/download files via REST calls or webdav while the underlying jackrabbit repository gives you a performant storage with the possibility to store your files in a tree structure (or flat if you like).

Both jackrabbit and sling support an event mechanism where you can asynchronously process the image after upload to i.e. create thumbnails.

The manual at http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html describes how to manipulate data using the REST interface provided by sling.

Lucknow answered 2/12, 2011 at 19:31 Comment(0)

Storing the images as blobs in an RDBMS in another option, and you immediately get some guarantees about integrity, security etc (if this is setup properly on the database), store extra metadata, manage the collection with SQL etc.

Carbineer answered 2/12, 2011 at 15:2 Comment(1)

It should be noted that in applications where the volume of files being put into the system is very high, this isn't always an option. The blobs are stored as full files and not chunked, so the row values can get really large and make DB backups exponentially larger. One should always consider replication considerations and input volume before going with this option. – Malamut 2/12, 2011 at 16:8

Recommended topics

Hot tags