found id Corporate Information Factory (CIF) Resources by Bill Inmon, Inmon Data Systems

Corporate Information Factory

> home > view content

The DSS Environment - Data Warehouse, Data Marts, And Data Mining
A Glimpse At The Past, A Peek At The Future

In a short half decade or so data warehousing and its mutant forms have gone from theory derided by academics to conventional wisdom. By all accounts data warehousing is (or is being) embraced by practically every company that operates in a competitive environment. Big companies and small ones, IBM dominated, H-P dominated, DEC dominated companies, banks, insurance companies, manufacturers, retailers, airlines, health providers, telecommunications, and government agencies all have discovered that data warehousing delivers on its promise.

One of the interesting facets of data warehousing is the fact that data warehousing has, in fact, more than delivered on its promises. In years past we were told (and we believed!) that if we didn't use GOTO statements our problems would be over. Then we were told that secretaries would be doing COBOL programming. The next hype was that 4gl technology would give us a 1000% increase in productivity. And if we just went database then our difficulties would be over. But database wasn't enough - what we really needed was relational database.  Soon we were told that the problems with information would be solved if everyone had a pc on their desk.  But pcs were inadequate unless we could interconnect them. The promise of technology solving the problems of "information" continues on ad nauseum. In every case, the reality of the promise of technology fell short of the hype.

But data warehouse - while never hyped excessively by the press - has stood the test of reality and in fact data warehouse has more than delivered on the hype that has been made on its behalf.

This article will discuss some of the reasons why data warehousing has delivered on the promises that have been made, a brief discussion of some of the notable past milestones associated with data warehousing, and a look into the future.

 

WHY DATA WAREHOUSE HAS SUCCEEDED

Data warehouse has succeeded because it fulfills a very basic set of needs for information that every corporation has. Data warehouse provides a very fundamental foundation for information because data warehouse:

  • provides integrated data across all applications so that a truly corporate view of information can be made,

  • contains a robust amount of history, so that both current information needs and historical needs for information can be met, and

  • holds both detailed and summary data so that management perspectives can be created.

Prior to the data warehouse, when there were only legacy, operational applications, integration of data and information was only a dream.  Each application had its own unique view of who a customer was, of what a product was, and what an order was.  No two applications agreed on anything, and a corporate perspective of information was fiction. In addition legacy applications looked at and contained only very current data.  Historical data was not kept in any organized manner. Yet another reason why legacy operational systems did not meet the informational needs of the organization is that legacy operational systems concentrated on only detailed data.  Summary data was never anything but a very small part of the operational environment.

Data warehouses squarely address these inadequacies of information of the operational environment.

 

INTEGRATED DATA

Data in a warehouse is integrated. Because data is integrated in the data warehouse environment, it is able to support a corporate perspective of data. With a warehouse, an executive can immediately look at corporate information.

The integration required to build a data warehouse is not an easy process. John Ladley of the Metagroup states that integration takes up to 75% of the development dollars for the building of the data warehouse. There is no mistake that integration is complex, painful, and requires much thinking. But integration in the data warehouse, once achieved, pays off handsomely.

 

HISTORICAL DATA

A second fundamental characteristic of data warehousing is that the data warehouse contains historical data. Typically data warehouses contain a robust amount of historical data - from five to ten years worth. There are many kinds of information processing that can be done with historical data. Corporations can start to understand the seasonality of their business across multiple years. By looking across multiple quarters, corporations can start to see the forest from the trees. In understanding their seasonality, corporations can tell whether they are truly making progress or merely marking time.

Another important use of historical information is in the understanding of habits of consumption of the consumer. People are creatures of habit. The patterns of consumption early in life usually stick with us for a long time. Knowing what has happened in the past is one of the primary keys to being able to predict what will happen in the future. With a data warehouse, organizations can start to get to know the history of their customers in a manner that was previously impossible, and in doing so, are able to start to anticipate the future.

 

DETAILED AND SUMMARY DATA

Summary data is important to management because management needs to see the larger picture before management can concentrate on the details that are of interest. In many regards detailed data merely hides information that is of interest to management. For example, when management asks for a report for departmental expenses, management does not want to see a line item listing of every expense for the month. Instead management wants to see what the total expenditures were for the department. If management is interested, they will ask, at a later point in time, for the details of a given type of expenditure for the department. But to have the details of all expenditures reported defeats the purpose of giving management the information they requested.  Data warehousing carries with it both summary and detailed data and as such is ideal for management's information needs.

 

WHO HAS ADOPTED DATA WAREHOUSE WITHIN THE ORGANIZATION

Data warehousing has found uses in many places. But the typical organizational entities that have first adopted data warehouses have been:

  • the finance department,

  • the marketing department, and

  • the sales department.

There have been notable successes elsewhere, but these organizational entities are the natural home of the data warehouse. Other successes in data warehousing have occurred - to a limited extent - in accounting, human resources, actuarial, and engineering departments.

Data warehousing has been accepted wherever there is a sophisticated need for information. Only the smallest, least sophisticated shops have seen fit to try to do business without a data warehouse.

 

MUTANT FORMS OF DATA WAREHOUSES

There are several mutant forms of a data warehouse that are of note. One form is the operational data store, or "ODS". An ODS is a data warehouse in the operational environment. An ODS shares many of the characteristics of a data warehouse, while having some unique characteristics of its own. An ODS can be updated and can provide high transaction response time, something a classical data warehouse cannot.

Another mutant form of a data warehouse is the data mart. A data mart is a departmentalized form of a data warehouse. A data mart is similar to a data warehouse except for a few important differences, such as:

  • a data mart is customized for the needs of a single department,

  • a data mart contains less historical data than a data warehouse,

  • a data mart operates on technologies that are suitable for a much lower volume of data than a data warehouse,

  • a data mart has many more indexes than a data warehouse,

  • the optimal data structure for a data mart is the star join. Of course, for the much more voluminous data warehouse, a much more normalized stricture is appropriate,

  • the pattern of usage for a data mart is fairly predictable,

  • data marts contain much more summary data than a data warehouse,

  • a data mart for one department looks very different than the data mart for another department,

  • a data mart contains very little, if any, detailed data, and so forth.

There are then some important and distinctive architectural differences between a data warehouse and a data mart.

 

A BRIEF HISTORY

The brief history of data warehouse is shown in the time line.

 

TIME LINE

Data warehousing rose from the origins of the notion that there should be a split between different types of databases. In the beginning of database, it was thought that there should be a single database for all types of processing. But reality showed - for a variety of reasons - that more than one type of database was needed. The split between database types was made between operational and data warehouse (or DSS) databases. The splitting of processing between multiple databases occurred because of many reasons:

  • operational data bases require split second response time; data warehouse, DSS processing do not,

  • the clerical community uses transaction oriented data bases; the managerial community uses data warehouses,

  • up to the second decisions are made from operational systems; long term decisions are made from data warehouses,

  • operational databases contain very current information; data warehouses contain historical information,

  • operational databases are very unintegrated and are application specific; data warehouses contain integrated data;

  • operational databases are designed for detailed data; data warehouses are designed for detailed and summary data,

  • the requirements for processing are known before the system is built in an operational environment; the requirements for processing are discovered as part of the development process in a data warehouse environment;

  • requirements for processing in an operational environment are static; requirements for processing in a data warehouse environment are heuristic and are discovered through iterative development, and so forth.

From the notion that there should be a single database to serve the corporation's needs came the notion that separate and distinctly different database types were needed to serve the needs of corporation.

 

CREATING THE DATA WAREHOUSE

While the notion of data warehousing was enormously appealing, the first issue the corporation faced was that of creating the data warehouse from the data found in the legacy, operational environment. At first it was thought that the problem was as simple as moving data from an operational platform to a data warehouse platform. There was the notion that replication of data was all that was needed. But very quickly it was discovered that merely moving data from one platform to another was not the basis for building a data warehouse. While data certainly needed to be moved, the data required integration and transformation during the moving process. Very quickly integration and transformation technology appeared and the ability to automatically generate code came on the scene. Sophisticated shops discovered that there was no need for many programmers to manually create the code needed to integrate the data as it passed from the legacy, operational environment to the data warehouse.

Less sophisticated shops approached the integration and transformation process with manual programmers and soon had a small army of technicians manually producing code for the integration and movement of data into the data warehouse environment.

The next phase in the evolution of the data warehouse environment was that of the advent of the data mart. Data marts have always been a part of the DSS architecture.

The earliest manifestation of the data warehouse/data mart architecture was in a form that can be called the "independent" data mart.  In the independent data mart, the data mart is created directly from the legacy operational applications. There is no data warehouse where there are independent data marts. Independent data marts are popular because they can be:

  • built cheaply,

  • built fast, and

  • built simply.

For a while independent data marts were very popular. But there reached a point where several major architectural flaws with independent data marts appeared. When a corporation built more than one independent data mart, it was noticed that:

  • there was massive redundancy of data (primarily detailed data) from independent data mart to another,

  • the number of interface programs from the independent data marts back to the legacy, operational application environment grew exponentially,

  • there was no single corporate "source of truth" and as a consequence, different departments were saying something quite different about the same data based on analysis obtained from their independent data mart, and

  • the machine resources required for extracting legacy, operational data from the same application by each independent data mart grew intolerable.

In a word, organizations that built a series of independent data marts simply did not get their money's worth from data warehousing. Soon it was recognized that independent data marts were not the solution to the corporate information problem. After a short amount of time, data architects perceived that dependent data marts were the proper architecture.

In a dependent data mart architecture there is a central corporate data warehouse that feeds the dependent data marts. This architecture is sometimes called the "hub and spoke" architecture, where the data marts are the spokes and the data warehouse is the hub.  The hub and spoke architecture has much to commend itself:

  • there is integration of data and reconcilability of data at the hub,

  • there is autonomy of processing at the spoke,

  • there is no necessary redundancy of data at the spokes,

  • there is a rich amount of history at the hub, and so forth.

 

DATABASE DESIGN

The general patterns of database design have mimicked the evolution and sophistication of the data warehouse, DSS environment over time.  In the early days, when there was a nascent data warehouse, classical data normalization was the basis for design. As independent data marts emerged, star joins and snow flake structures became the norm for design. And as the hub and spoke architecture evolved, normalized, data model based design for the hub and star join, snowflake design for the spokes became the norm.

 

DATA MINING

But building a data warehouse/data mart hub and spoke architecture - while it is a most important step - does not guarantee success with DSS.  Once the warehouse and its architectural components are built, it remains to use and exploit the warehouse environment.  Data mining is the next logical step in completing the circle of effective DSS.  With data mining, important and previously unknown business patterns can be discovered, relationships between obscure and otherwise unnoticed variables can be examined, and long term trends can be measured. In short, data mining fulfills many of the expectations of data warehousing.

An interesting question that almost immediately arises is - can data mining be done without building a data warehouse? Does a corporation really have to go the effort and investment of building a warehouse in order to start to use data mining technology successfully? The answer is that data mining can be done with no data warehouse or data marts at all. But just because data mining can be done does not mean that data mining can be done effectively. The real issue is - can data mining be done EFFECTIVELY in the face of no data warehouse, DSS infrastructure?  When effectiveness is considered, the answer is that data warehousing is absolutely essential for effective data mining.

Why is a data warehouse, DSS infrastructure essential for corporations that are serious about data mining? Simply stated, data warehouses prepare the raw data of the corporation for data mining analysis in an optimal manner. This preparation before analysis shows up very beneficially in many ways.

One of the essences of a data warehouse is that data is integrated as it is placed in the data warehouse.  This means that a lot of care is taken to bring uniformity and continuity to the understanding of common corporate objects, such as who is a customer, what is a transaction, and so forth. By building the data warehouse first, the data miner can dive into the analysis immediately and can start to achieve results immediately. But if the data miner does not have a data warehouse to operate from, then the miner must spend precious time (lots of precious time!) gathering the data, cleansing and scrubbing the data, integrating the data and so forth. It will be a long time until the data miner is set to even start the analysis portion of data mining if there is no warehouse infrastructure.

A second reason why the warehouse sets the stage for success in data mining is that the data warehouse pays close attention to and collects and organizes historical data. The data miner needs a wealth of historical data in order to find the patterns and relationships that are of interest to the corporation. If there is no central collection of historical data like that that exists in the warehouse, then the data miner must go out and find the historical data to operate on. In some cases the data miner can find the historical data. But in other cases the historical data simply does not exist. When there is a data warehouse, the data miner can sit down and immediately start to work on the historical data inside the warehouse. The data miner is a long way from any meaningful analysis when the miner has to first gather and assimilate the historical data on which mining is to be done.

The third reason why data warehousing opens the door to effective data mining is that the warehouse contains both summary data and detailed data. Unquestionably, the miner needs the detailed data in order to do analysis. But the summary data is most useful in another way. Summary data is most useful at the outset of analysis when the data miner is planning an approach and needs to quickly look over the entire collection of detailed data.  When there is a representative sample of different types of summary data, the miner can quickly survey what is and is not in the warehouse. The summary data can save the miner massive fruitless iterations of analysis.

 

THE EMERGING TRENDS

What is on the horizon for the data warehouse, data mart, data mining world?

 

DATA MANAGEMENT

One of the obvious trends is that of the need for the management of the warehouse environment. Data warehouses and data marts tend to grow at an amazing rate. As they grow, the volumes of data that find their way into the data warehouse become an obstacle to success. With the growth in the volume of data comes a slow down in performance and an increase in budget. Soon the organization comes to the realization that the data warehouse infrastructure needs to be managed.

One of the first discoveries the manager makes is that managing the DSS data warehouse environment is nothing like managing the classical operational, transaction oriented environment. The DSS data warehouse infrastructure has its own unique set of needs and peculiarities.

One of those peculiarities is that of dormant data that creeps into the data warehouse. Dormant data is data that lands in the data warehouse that is never used. In the early stages of a data warehouse there is little dormant data. But as time passes the amount of dormant data increases to the point that there is much more dormant data in the warehouse than data that is actively being used. At this point, the dormant data needs to be archived in order to keep processing streamlined.

Other management issues include the need for the constant monitoring and cleansing of data as it enters the warehouse and as it resides in the warehouse.

 

METADATA

As corporations mature in their understanding of the warehouse environment, one obvious technology that emerges as being very important is that of metadata. In its simplest form metadata is data about data. But in a data warehouse environment a much more sophisticated view of metadata is needed. There are many types of metadata that are beneficial to the DSS data warehouse infrastructure.

But metadata was not part of the first generation of data warehouses for a variety of reasons. The first reason why metadata was not an immediate part of the DSS data warehouse infrastructure is that people were so anxious to get their first generation data warehouses started that they went after only the most obvious parts of the warehouse.  These first generation data warehouses concentrated merely on getting the data into their data warehouse. But we are starting to see much more sophisticated second generation data warehouses being built now that do include metadata as an integral part of the infrastructure. The second reason was that there really wasn't any appropriate technology for the capturing and management of metadata in the early days of data warehousing.  And third, it took imagination in the early days to see why metadata was so important. There is only so much imagination to go around.  Today, after real experience with first generation data warehouses, people base decisions on experience rather than imagination. And experience is always a much more powerful motivator than imagination. An experienced data warehouse administrator knows just how important metadata is.

In order to illustrate the importance of metadata to the DSS data warehouse environment, consider a metaphor. Metadata is like a sign post on the street.  When you drive back and forth to work each day, you don't pay much attention to sign posts because you already know they are there and you are very familiar with the road you are driving on.  But when you drive from Chicago to Phoenix on your vacation and you find yourself in Gallup, New Mexico, you pay rapt attention to sign posts because if you don't you may end up in Albuquerque or El Paso.

The same goes with metadata.  When operational systems were being built and the same activity was done repeatedly, there was little need for metadata. But when DSS systems are built where people peruse data, create hypotheses, and heuristically analyze data, there is a real need to know what is in the data in the first place. Where people are dealing with the unfamiliar, the sign posts become invaluable. Metadata quickly becomes the key to effective use of the data warehouse.

One of the major issues of the DSS data warehouse infrastructure is that every technology known to man seems to be found in the data warehouse environment.  There is H-P, IBM, NCR, DEC, SUN, Sequent, et al. There is Oracle, Sybase, Teradata, DB2, Informix, and Red Brick. there is Information Advantage, Business Objects, Cognos, DSS Agent, Brio. There is have Lotus 1-2-3, Excel, and a host of other spreadsheets. There is Prism, ETI, and SQL Junction. Trying to gain a consensus of opinion among these vendors is impossible. Yet in order to be successful, there is a need to have sharability and manageability of metadata across these vendors and products.  One of the biggest challenges facing the metadata manager in the data warehouse environment is crossing the technological barriers found in the environment.

Look at the problem conversely. If you don't share metadata across these different products, platforms, and vendors, you end up with isolated islands of automation. The result is a chorus singing, but one person is singing rock and roll, another is singing soul, and another is singing light opera. The music is a cacophony. There is a need to share metadata across the many technologies found in the data warehouse DSS environment. Metadata becomes the glue that holds the different technologies and the different parts of the DSS data warehouse environment together.

Of course metadata in one form or another has been around for a long time and there have been some limited successes. But there are some severe limitations with the repository approach to the management of metadata as it applies to data warehouse.  The primary limitation is that the repository approach to the management of metadata does not account for the need for autonomy of processing by the end user. When the end user is working away on a Saturday afternoon on Lotus 1-2-3, the end user is not about to let an administrator interfere with the flow of work and analysis. The very essence of much of end user processing is the freedom of the end user from IT control.  End users simply are not going to stand for an administrator - any administrator - telling him/her what can and can't be done.  And what are Lotus and the spreadsheets anyway but metadata.  So one of the reasons why metadata in a repository was a failure was that it did not account for the need for autonomy of processing by the end user.

But there are other considerations to the architecture needed for metadata management across the DSS environment. If you have no central control of metadata, you will never have any uniformity of definition of data. There will never be any consistency of processing across the organization.

In order to be effective, there needs to be a balance between the need for sharability of metadata and autonomy of metadata. The balance can only be achieved by a distributed metadata architecture where the different nodes of the architecture have their own metadata. A distributed approach to metadata management is the only viable approach to in the DSS data warehouse environment.

 

TOOLS

Another arena where growth is bound to occur is in the arena of sophistication of end user tools. Already there are starting to appear different kinds of tools. There are cube technologies, standard relational technologies, powerful spreadsheet technologies and so forth. As more data becomes truly available to the end user, the tools that cater to the particular needs of the end user will have even greater variations and capabilities than they have today.

 

STORAGE TECHNOLOGIES/ARCHIVAL TECHNOLOGIES

Another important advance is in the proliferation of the different kinds of storage technologies that are becoming commercially viable. There is so much data pouring into the data warehouse environment that standard disk storage cannot possibly - economically and technologically - hold all of the data in a data warehouse. Furthermore, with data warehousing there is no need to hold all data in an online mode. Near line software and hardware will emerge that will allow data to reside on a hierarchy of storage.

On a little bit longer horizon, archival technology will surely emerge. There is much to the subject of archival technology, and if there is one Achilles heel of data warehousing, this is it. Today's archival technology is crude compared to the technology that will be coming tomorrow.

 

META PROCESS

An interesting technology that is just now in its infancy is that of meta process technology. Meta data technology - in one form or the other - has been around for quite a while. But meta process technology has an important place and will start to appear commercially in the next few years.