found id Corporate Information Factory (CIF) Resources by Bill Inmon, Inmon Data Systems

Corporate Information Factory

> home > view content

THE FUTURE OF DATA WAREHOUSING: ALTERNATE STORAGE

Ask any data warehouse developer what media data will reside on and the automatic answer is “high performance disk storage”.  Most data warehouse developers have never built a system on anything but high performance disk storage during their entire career.  Indeed many data warehouse developers are not even aware that there are alternatives to high performance disk storage.

  • There are many reasons why the volume of data in the warehouse is exploding:

  • data warehouses carry historical data,

  • data warehouses carry detailed data,

  • data warehouses carry data for which there is no known need,

  • data warehouses carry eCommerce data, and so forth.

In a word, the volumes of data found in the data warehouse surpass anything ever seen before.

But when you look into the future and see what is in store for data warehousing when it comes to storage, surprisingly the answer comes back - the future of data warehousing is NOT high performance disk storage, despite the strong track record of disk storage for the past twenty years and the protestations of the storage vendor.  Instead high performance disk storage plays only a secondary role in the future of data warehousing. The real future of data warehousing is in a storage media collectively known as "alternative storage".

ALTERNATIVE STORAGE

Alternative storage consists of two forms of storage - near line storage and/or secondary storage. Near line storage is siloed tape storage where siloed cartridges of tape storage are managed robotically. The technology for siloed tape storage has been around for a long time and is certainly proven and mature technology.

Secondary storage is a form of disk storage but whose disk is slower, significantly less expensive and less cached than high performance storage.

There are lots of reasons why alternative storage fits well with the data warehouse environment. Perhaps the most fundamental reason why there is such a good fit is that data warehouse data is very stable. The nature of data in a warehouse is that the data is put into the warehouse in a time stamped snapshot mode. If there is a change in the data that the warehouse needs to be aware of, a new snapshot is made. The old snapshot of data remains undisturbed. Because of this mode of storing data, no updates are made into the data warehouse.  Ultimately style of storage and processing results in very stable data.  The stability of the data fits very nicely with the "write once" data found in near line storage.

But there are some other reasons why data warehouse data fits nicely on alternative storage. The next reason is that the queries that operate on warehouse data need long streams of data, and often times that data is stored sequentially. Unlike a job stream for online processing where there is constant demand for different units of data from different parts of the disk device, in data warehouse processing the processing that occurs is fundamentally different. Both near line storage and secondary storage fit this model of a job stream very nicely.

Another very important reason for alternative storage is that of the need to store many, many records in the data warehouse. Because data warehouses store detailed and historical data, they contain far more data than their online, OLTP brethren. The ability to store far more data on near line and/or secondary storage is a very important reason why high performance disk storage is not the future of data warehousing.

Not only can much greater volumes of data be stored in alternative storage, but those massive volumes can be stored much less expensively than on high performance disk storage. How much cheaper? About an order of magnitude less expensively.

One can hear the high performance disk vendor proclaim - "but hardware is getting cheaper all the time". Indeed the rate at which secondary storage and near line storage is getting cheaper is at a faster rate than high performance storage. The hardware vendors who wish to maintain the status quo have been saying this for as long as there has been a computer industry.

There is yet another powerful reason why high performance disk storage is not the future of data warehousing and that reason is that - IRONICALLY, AND MUCH TO THE CHAGRIN OF THE HIGH PERFORMANCE VENDORS - performance gets BETTER, not WORSE when you move your data to near line storage or secondary storage. The reason why performance gets better by moving data to near line or secondary storage is because of the phenomenon in data warehousing called "dormant data". Dormant data is data that is seldom or never used. In the early days of data warehousing when the warehouse is new and small, there is little or no dormant data. But as the warehouse matures, the volumes of data rises and the patterns of usage of the data stabilize. Soon only a fraction of the data warehouse is being used. At this point, the dormant data is moved to alternative storage. Performance for the remaining actively used data picks up dramatically. If dormant data is left on high performance disk storage, the dormant data "gets in the way" of query processing. Data that is needed for the query is hidden by the masses of data that is not regularly needed. But by moving dormant data to alternative storage, performance is greatly enhanced.

But the greatest advantage of selecting alternative storage as the basis for the data in the data warehouse environment is that the designer can choose the lowest level of granularity desired for the data warehouse. When high performance disk storage is used as the only medium on which data is stored, then the designer ends up being restricted as to how much detailed data can be placed in the data warehouse. The telecommunications designer must aggregate or summarize detailed call level detail. The bank designer must add together checking and ATM activity into a monthly aggregate record. The retailing executive must summarize POS data to the store level and/or to the daily level. In short, placing the data warehouse on disk storage forces a compromise to occur. But when the medium the bulk of the data in the warehouse is stored on is alternative storage, the designer can afford to store data at the lowest level of detail that exists. In doing so the data warehouse ends up with a great deal more functionality than if the warehouse were stored on high performance disk storage.

There are then some very powerful reasons why the medium of storage for the data warehouse should be alternative storage. Admittedly some of the data warehouse data - the actively used component of the warehouse - will be stored on high performance disk storage. But the vast majority of the data stored in the warehouse will reside on slower, less expensive alternative storage.

The notion that data should be stored on different media based on the volume and usage characteristics of the data is not a new idea. Years ago there was the notion of technology called HSM - hierarchical storage management. HSM was the intellectual predecessor of alternative storage. The primary difference between HSM and alternative storage is that alternative storage operates at the row or record level while HSM operates at the table or data set level. Management of storage at the table or data base level is simply unthinkable for the volumes of data and the kind of processing that occurs in the data warehouse.

In order to make the alternative storage architecture perform at the optimal level, two types of software are needed. The first type of software that is needed is that of the activity monitor. The activity monitor sits between the data warehouse dbms server and the users and collects information about the activity that is occurring inside the data warehouse. Once collected the data warehouse administrator is in a position to be able to know what data is and is not being used in the actively used portion of the warehouse. With that knowledge the data warehouse administrator is able to precisely determine what data belongs in actively used storage and what data belongs in alternative storage.

The second type of software that is needed for the data warehouse environment that operates on alternative storage is software that can be called a cross media storage manager. The job of the cross media storage manager is to manage the traffic between the actively used storage and alternative storage. The traffic can be managed by actually moving data to and from one component to the other or can be used to satisfy query processing where the data resides in either actively used storage or alternative storage.

Both types of software are needed in order for alternative storage to operate effectively. As a rule the activity monitor is first used to determine how much data needs to be placed in alternative storage. After the decision is made to place data in alternative storage, cross media storage manager and alternative storage are purchased and installed.

The alternative storage solution for data warehousing is a compelling story. For warehouses that will grow to any size at all, alternative storage is not an option - it is plainly mandatory. What then are the obstacles to the success and adoption of alternative storage? The primary obstacle is a familiar one to those who have been around the information processing community a while. The attitude of - "well, we didn't used to do it that way before...." is the primary reason why people do not immediately adopt alternative storage. And the vendors... the vendors have made so much money for so long selling disk storage as if all there were was OLTP online processing. The very success of the high performance disk vendors traps them into thinking that their world will remain static forever. The high performance disk storage vendors want to stick their head in the sand and pretend that the world is not changing, such has been their success.

But the genie is out of the bottle and won't be coming back again.