Designing Big Data Storage infrastructures
Big Data turns the traditional storage problem upside down. Up until now the growth in storage consumption has been looked at as a negative, something to be dealt with. Storage growth and the need to retrieve information quickly is a continual challenge and has been handled by begrudgingly adding more capacity and by encouraging policies that would have users delete or move unnecessary information to different and less expensive storage tiers. The concept of Big Data takes a different perspective. Growth in data is no longer being viewed as more bytes to store but as more information to mine for value. It assumes that if information is a weapon then the more of it that’s available the better.
What is Big Data?
Big Data is essentially a combination of processes and infrastructure that allow organizations to capture, analyze and manipulate large sets of structured (information in databases) and unstructured (file) data with the eventual goal of extracting value from that data. The particular value will differ from organization to organization, but it may be assembling the right information to more accurately forecast the weather in the near term, or using similar data to predict when climate shifts may occur over the long term. Large data sets may also be used to identify changes within trading patterns that an investment firm can use to guide investment strategies or spot business trends that a manufacturer or web services provider can take advantage of to build a new ‘got to have’ product sooner.
Data Growth is the Common Denominator
While the above examples cross industries and data types, the common thread of Big Data is the need to initially capture the complete set of raw data. At times there is no advanced warning of which data needs to be later accessed at any given point in time. In many cases much of the data segment needs to be online so that all the stored data points can be queried. As the rationale goes, the more data there is, the more accurate the predictions can be from that data. This, ‘most of the data, most of the time’ requirement puts new demands on the storage infrastructure to support Big Data
Requirements for Big Data Storage Infrastructure
Legacy storage infrastructures are not well suited for this type of ‘no limits’ storage strategy that Big Data will impose
on storage infrastructures. While these capacities can be purchased upfront, this will lead to tremendous unused
capacity in the intial stages of deployment. It is more cost effective to add capacity as needed. A Big Data infrastructure needs to scale capacity while maintaining performance to provide rapid answers when it is queried. It also needs to be simple to manage. Old paradigms that assign numbers of storage administrators to terabytes owned will no longer be economically feasible as a single storage administrator will need to manage petabytes of information in order to keep the cost equation in line. With these requirements, the ideal infrastructure for Big Data may be scale-out storage. The Need For Scale The ability to scale is potentially the most critical aspect of a Big Data storage infrastructure. Big Data capacities are measured in 100s of terabytes (TBs) to start with and are growing to tens of petabytes (PBs). The challenge that Big Data places on storage infrastructures is not necessarily what the starting capacity or ending capacity will be, but the rate of growth within that capacity range. In the legacy scale-up storage system, where the ability to support a
given capacity is purchased upfront, the start up cost to purchase a system capable of scaling into the PBs often
would be too great to be justifiable. It’s simply unreasonable to expect the standard scale-up infrastructure to cover this
range in capacity demands.
Finally, the fundamental architectures that are the foundation of traditional scale-up storage were designed decades ago with much smaller capacities in mind. As a result, the typical 16TB volume size limits are minuscule compared to the needs of Big Data. Even when those size limits are increased by a factor of five, they will come short of what the traditional Big Data needs will be. Scale-out NAS systems like Isilon’s are much better suited for this type of Big Data environment. Scale-out NAS is a file based storage system comprised of a group of servers clustered together as a single system. Each file server, called a “node”, can be added to the cluster in realtime without taking the system down. Each time a node is added to the system it increases the storage capacity, processing performance and network I/O bandwidth, allowing the system to grow these three parameters in unison.
The Need For Performance
In addition to scaling capacity a Big Data storage infrastructure needs to scale performance, measured by storage processing power, disk I/O and network I/O. Queries on Big Data could be across billions of discrete files or include billions of rows within databases and a fast response time can mean meeting a production deadline, warning 1,000s of an impending disaster or cutting months off a medical breakthrough. The challenge is that in legacy storage architectures the right amount of storage processing power, disk I/O and network bandwidth required have to be planned for and purchased upfront. With Big Data the ability to accurately predict long term storage growth and more importantly, to afford the up front costs of that required performance is impractical. Once again Scale-out NAS provides an answer since it can be purchased in a relatively small capacity upfront, 50TBs for example, and then scaled into the PBs if needed as
more data comes in and as more data mining is performed. Each node brings the additional processing and I/O required to support that growth. In other words, the Scaleout NAS becomes faster as its capacity increases. That additional power can be used to manage data protection processes, storage optimization techniques like cloning, snapshots and automated tiering. The ability to scale network I/O is equally critical since most Big Data initiatives support big compute infrastructures that consist of dozens, hundreds or thousands of servers, all supporting potentially thousands of server requests.
The Need For “One-ness”
Big Data also has a preference that all the data assets appear to be on a single volume, which is a challenge for traditional file systems and NAS solutions. While some of these file systems have just recently broken the 16TB limit many are still bound to it.
Very few can support anything more than 100TB volume sizes and most have unclear roadmaps on how they will expand beyond 100TB. Volume size limitations like these will quickly become a problem in the Big Data infrastructure
if they have not already and the Big Data storage administrator will be forced to manually chain multiple volumes of information together or write complex queries that do. Randomness of queries is one of the essential components of Big Data and the storage infrastructure chosen for it should support a single volume of virtually limitless capacity allowing all the data to be accessed in a single location. There is also the reality of administration. Multiple volumes lead not only to complex queries but also to complexity in traditional IT processes, like capacity allocation and data protection. Without the single volume storage administrators will be forced to manually inspect each volume to understand which is the most ideal candidate for a new loading of data. Also, since new data sets can come in large chunks no single volume may have enough free space to handle the new inbound data. This will lead to data being migrated back and forth between volumes to clear out enough space. In similar fashion the data protection process becomes more complicated.
As new storage is added, because current volumes cannot support inbound data sets, this storage has to be assigned to new volumes. It’s very easy for the backup administrator to miss the addition of these new volumes and data sets can go
unprotected for weeks until the oversight is discovered. While many categories of storage can easily have additional
capacity plugged into them very few can actually have that capacity automatically assigned and available to the applications that need it. With a multi-volume approach capacity allocation has to stop until a storage administrator
decides which of the new capacity should be added to which volume. Often this is not a very scientific process and
accuracy of assignment is sacrificed for speed of capacity delivery. With the single volume approach all capacity is
instantly added to one volume and instantly available to the application; decisions don’t have to be made and accuracy
is not sacrificed. In fact most storage managers will find that capacity utilization goes up dramatically with the onevolume approach since no capacity is being held captive on less active volumes.
The Need For Economies
Finally, all of the Big Data storage infrastructure has to be affordable. Once again the scale-out architecture has an advantage here because capacity and performance can be added in a near linear fashion. “Pay as you grow”, which has been the hallmark of Scale-out NAS, turns into “earn as you grow” in the Big Data context since typically the more data that’s available to the application and the business line owners, the better and more informed the decisions can be. One of the ways to maximize these investments is to leverage storage tiering like that available within Isilon’s OneFS operating environment. This allows sections of the Big Data asset to be stored on less expensive SATA storage and then promoted to higher performance, SAS, fibre channel or even solid state storage as the query activity justifies it.
Big Data is more than just the domain of scientists looking for cures or internet companies mining search logs to identify new services to monetize. It’s also becoming a key resource for businesses looking to optimize business processes, improve production cycles and increase profitability. These organizations are understanding that information is indeed power and it’s becoming the responsibility of the data storage infrastructure to house and deliver that power. A Big Data storage infrastructure is a unique storage project compared to most other environments. While the requirements of Big Data, such as the ability to scale capacity and performance while maintaining reliability and cost effectiveness, are similar to other data center storage projects. What makes Big Data unique is the extreme to which it takes these requirements. Speed to scale, flexibility to scale and a watchful eye on cost are all critical and it’s these extreme requirements for which Scale-out NAS is best suited.
By George Crump, Senior Analyst