What is Object Storage

Submitted by Sarath Pillai on Mon, 05/05/2014 - 03:12

Recently I happen to read a Top 10 Prediction report from IDC. Apart from the predictions related to IT spending, cloud, and mobile users, the thing that I was amazed to read was the the growth of data and its enormous predicted size. IDC predicts that the total volume of data is expected to grow at the rate of 50% each year.

And by 2020, IDC predicts that the volume of data will reach around 40 Zettabytes(A billion terabytes is equal to 1 Zettabyte). Another important fact about this gigantic amount of data is that 90% of these will be unstructured data.

Well, you must be thinking “what is unstructured data?”. I happen to read a simple definition about unstructured data, which pretty much explains what it is all about.

“Anything other than a database, can be called as an unstructured data.”. Well that’s too broad a statement. Unstructured data does not often reside inside a row and named column. The problem with unstructured data is that it’s quite difficult to analyze it further, to make any meaningful decision.

Some examples of unstructured data include Audio files, Video files, Image files, Email’s etc. These are the types of data that we use more on a daily basis. Now you can imagine why unstructured data contributes 90 percent of the predicted size of all data.

Growth of Unstructured Data per Year

To understand this enourmous amount of unstructured data explosion, simply imagine the number of times you share an image, a video or an interesting document with your friends online, using social media platforms like twitter, facebook, or google+. Or imagine the number of times you open and save your critical data inside skydrive, google drive, dropbox, or any other cloud based storage service. The architectures supporting those platforms have to deal with this gigantic amount of user generated information on a daily basis. And they have to scale their storage as the number of user's and their data increases.

My daily work revolves around AWS cloud. AWS cloud is the fastest growing public cloud out there, and they have very huge fortune 500 companies running entire architecture inside. I always wondered how they are managing this huge amount of data storage and computing power with an exceptional uptime. They have a storage-as-a-service platform called Simple Storage Service, more popularly known as s3.

In 2011 itself, their storage service s3 (which is mainly used for unstructured data storage, with high level of redundancy and uptime) surpassed 500 billion objects(we will be discussing objects in some time.)

Rackspace Cloud Files, AWS s3, Google Cloud Storage, Windows Azure Storage are all dealing with this gigantic volume of user generated unstructured data.

In 2012 Microsoft Azure also crossed the mark of 4 trillion objects in their public cloud storage.

One thing is clear from these numbers. The normal file system based storage is not going to scale to these amounts in any manner.

Most of us store unstructured data on our computers organized in a meaningful manner in the form of named directories. Named directories(or folder's) are helpful in organizing data in a human and program friendly manner. The underlying storage devices store the user's given data in blocks. Thanks to programmers who developed file system's to do this rigorous task of allocating free blocks to store files. File system's knows which blocks on the underlying device are free to store new incoming data, and mark the blocks as free when a user deletes a particular data.

So as i just mentioned, we store files inside different directories. We can have directories inside another directory. So basically this is a hierarchical storage system.

In technical terms a hierarchical name space. If you come from a Unix background, you must have already heard about inode. Every file and directories have got inode numbers. And a directory contains a table of inode numbers of files inside that directory. Basically this is a mapping of file name to inode number's of all files inside that directory. If you are new to inodes i recommend my below articles.

Read: Understanding Inode and its structure in Linux

Read: What happens when a file is deleted in Linux

So if i have a directory with a million files inside, then that directory should know all the inode number to file name mappings of those million files inside. This becomes a tough task for a file system, and most of them suffer a performance issue in such cases. And if it gets corrupted, then a program like fsck will take days to complete. Hence storing large number of files inside (like the amount of data we discussed in the beginning ) is not at all feasible.

And most of the times we store files in a hierarchical manner for human understanding. Like you might have a folder called as Photos, inside which you can have multiple folders like wedding, graduation, birthday etc etc. Even file system will follow this hierarchical identification to uniquely identify one file inside.

Searching and locating files inside a large hierarchical name space is difficult and time consuming. Hence we need a flat name space (where all files reside inside a single container). Although we say all files inside a single name space, we need a folder based structure for humans to classify his files (but to the system all files inside all folders in that container should be in one flat name space.). Hence we need a storage that does not struggle, when the number of files grows large. In fact according to our today's storage requirements, we need a storage method where we can scale to infinite amount.

Another major problem with unstructured data, as we discussed in the beginning is that its difficult to extract meaningful information from them. For example, from a data base i can search for older records which are completed and the user no longer uses them, and delete them. Or you can fetch details like how many users you had on your system who purchased a particular item etc etc(provided these things are storage in a database ).

Most of the times, unstructured data is stored and never used again. It is possible that the user required that data till a certain period of time, but still its unnecessarily occupying storage space. Also we cannot get much details from a file itself (for example file content, till when it will be used, if its an application generated content then no useful information in the file). We can only get details like when was the file created, when was it modified, who is the owner, where on the disk does it reside etc(this is basically the metadata of a file).

Usually these kind of metadata is stored inside an inode of a file. Although these kind of data is useful to an operating system perspective, its not quite useful to a user or an application (because it does not talk about the content of the file)

Also another problem with traditional methods of storing unstructured data inside file system is that it does not provide a per file security. Although you can achieve a per folder security to a certain extent with the help of NAS (Network Attached storage), but not a full proof security per each file. Another requirement is we do not have any method to mark a certain file as more important over the other (so that the file can be stored in a more redundant manner to protect from disasters.)

Related: Difference between NAS and SAN

What is an Object Storage?

The solution to these problems of storing large amounts of unstructured data inside traditional file system's is a technology called as Object Storage. In the beginning we discussed about a couple of large public cloud storage solutions like Amazon S3, Rackspace Cloud Files, Windows Azure Storage etc. All of these are object storage. The working principle of all of these large public cloud storage are the same, but are named differently. Below outlined are some of the note worthy points about them.

The unstructured data(what we call files), are stored as objects (we will discuss objects in some time) in object storage
Almost all cloud object storage solutions require a user account to store objects. A user can have containers (containers are nothing but a unique flat name space identified by a name). In Amazon s3 Language, containers are called as buckets.
A container cannot hold another container.
Objects are stored inside the container. Storing, deleting, retrieving, modifying and all other such operations are carried out using a REST API based interface.

Object Storages are mainly aimed at your application storage needs. Applications can access object storage directly with its API without the operating system in between. It is aimed at addressing your applications enourmous unstructured data storage. Object storage is best suited for storing digital media content, archives, backups or log files etc, its not suited for relational database storage & cannot be used to boot operating system. Although there are fuse drivers to mount an object storage container to your operating system, similar to mounting an NFS storage, its not well suited for your applications(as it introduces another level of fuse overhead, its not reliable, and will also introduce performance issues )

Related: How to performance tune NFS

What are Objects in an Object Storage?

The main difference is how we access files in a file system, and how we access objects in an object storage. In a file system we read files, but in an object storage we use HTTP GET method to get an object. In a file system we write files, but in an object storage we use HTTP PUT (or POST)method to write an object. This is the basic difference while accessing an object from an object storage, however there are other main differences.

In a regular file system as we discussed earlier, we store files in an organized hierarchical folder structure. While in an object storage we store objects in a single flat name space for faster retrieval (please don't forget the fact that most of the object store provides a folder like feel, which allows you to create folders inside your container, but files in all folders inside a container is in a single flat name space.)

Object Storage Flat Name Space and virtual View to user

From a data point of view, there is no much difference between an object and a file. The main advantage is the overcoming of the limit imposed by file system to store large amounts of data. So if i have a file in my file system, uploading that file to an object storage like s3 will make it an object with a unique identifier to later retrieve that object faster. While i was reading about object storage, most of them compared it with a parking ticket analogy.

When you park your car in a parking plot, the person over there will park your car and give back a ticket. While you are away, your car will can be displaced to another location, to accommodate other cars, but the person responsible for it will return your car to you, when you handover the parking ticket back to him.

Similar to the parking ticket analogy, we use Global Unique identifiers to uniquely identify an object in an object storage. This global unique identifiers can be a hash value of your data(the hash changes as the data changes). But we as humans, cannot remember those hash values to retrieve the object at a later point of time. Hence most of the public cloud object storage solutions gives the user the ability to define his own unique identifier, while storing an object. This is most of them times the object name itself(the underlying object storage can maintain a database which maps the user defined object name to the hash value of the object to retrieve the object uniquely)

Apart from the unique and fast identification of your object (in a flat name space), object storage provides another big advantage to developers. Developers most of the time requires a database, while designing an application, to store details about a particular file. Take for example i have an application for storing media files for a television series with multiple users uploading content. I must have a method to store things like the title of the series, if the series is a multi part series then its corresponding part, author of the series, and much more. The application might require these metadata to display to an end user.

Object storage provides a mechanism to include user defined metadata(called custom metadata), while storing an object. All major public cloud object storage(S3, Google cloud storage, Azure, Openstack Swift etc) provides custom metadata fields. These metadata are assigned with a user defined name value pair, while storing the objects in the storage using PUT/POST

Contents of an Object in Object Storage

Below mentioned are some points about objects and object storage in general.

An object can be viewed as a file with all its metadata, all in a single bundle.
Each object stored in an object storage has a unique id with it (This unique id mostly depends on the content of the object), but for user and application easiness, the user can name his own object id.
Retrieving of an object from an object storage is done using the unique id of the object
Objects are stored in a flat name space container
An object is never limited to the type or amount of metadata
An object in a container can be local or can be remote.
Level of data protection, Replication level, Moving objects to a different storage tier etc can also be done with respective metadata while storing an object in an object storage

What about the hardware, and storage methods used with Object Storage?

I do not have any internal architecture details about most of the previously discussed public cloud storage. But an open source Object Storage solution like the open stack swift can given you an in depth idea about building similar large architectures. Open stack swift provides almost all features provided by any enterprise public cloud storage solutions.

Any large architecture for storage which needs to scale without bounds, needs to be distributed cluster based(as scaling needs both storage and compute power ).

One such file based storage which can scale up to peta bytes level is Gluster. Even that is a distributed cluster based storage, where adding more storage nodes will increase the storage capacity as well as processing capacity. If you are new to gluster, i recommend by below article.

Read: What is Gluster file system and how it works

As far as hardware of an object storage is concerned, mainly its based on normal commodity servers with their own directly attached storage devices. Similar to gluster, adding more storage nodes with commodity servers can add up storage space of your object storage as well as performance (which is why it is sometimes called linear scaling solution.)

Yes you can use RAID arrays on those commodity servers to add an extra level of protection to your storage nodes. But this would add up further cost. Commodity hardware with regular directly attached storage devices is the best as object storage architecture itself provides multiple level of redundancy, and stores objects in multiple storage nodes.

Open stack swift allows you to even protect your objects at the data Centre level. In other words multiple data Centre will have same copies of your data, and can withstand outages of an entire data Centre which contains your storage nodes.

So as a footnote to hardware requirements, object storage does not require any special hardware to build(only commodity servers with its internal storage is required). And most of the implementation of Object Storage is a software only(that combines multiple individual hardware nodes for both compute and storage inside a single/multiple object containers)

Compatibility with different devices

Any application running on any operating system, can access data objects with a RESTfull API with ease. In short the operating system dependent file system storage method is no longer in the concept of the object storage. Also the level of security provided by object storage allows you to define per object security.

With an object storage system, you can define an object as publicly accessible, where that object can be downloaded or accessed using a simple web browser on any device. As its based on HTTP methods, there is no compatibility issues while accessing it.

Use cases of Object Storage

The main use case of object storage is for archival purposes(where data is stored as backup).
Its not well suited for database storage, and is not designed for databases
You cannot boot an operating system using an object storage.
If your requirement is to grow without bounds, nothing can beat object based storage
Its a storage for your application, and not for your operating system
Node level and data Centre level replication with multiple copies(all done through metadata)
A single name space than can span different geographical locations.

In the coming days, i will be writing an article series on Openstack cloud, with complete tutorials to build your own public/private cloud. On that series we will be discussing openstack swift object storage with all its components and methods to build an entire object storage architecture. Hope this article was helpful in getting an introductory idea about object storage.

Rate this article:

Add new comment

Comments

thanks

Permalink Submitted by Dennis on Sun, 07/10/2016 - 16:01

I'm new to the linux world, while looking at a big implementation. The sheer info on storage systems was quite overwhelming, so your posts really helped! next up is the openstack/gluster series! ;-)

Search form

You are here

What is Object Storage

What is an Object Storage?

What are Objects in an Object Storage?

What about the hardware, and storage methods used with Object Storage?

Compatibility with different devices

Use cases of Object Storage

Comments

Add new comment

Plain text

Search form

Today's Most Popular

Most Commented

Top Rated Articles

Get in touch with The Authors

Follow Us

Recent Posts

Last Viewed