By Vivek R.
Gupta, Senior Consultant, vgupta@system-services.com
This white paper introduces data warehousing
concepts. This paper is derieved from the first two chapters of our
forthcoming book Data Warehousing with MS SQL Server.
You may download this white paper in MS Word
6.0/70 format: An
Introduction to Data Warehousing (MS Word Format) (right-click
and save file).
Do you need to get feedback from your Customers, Distributors, or Vendors? Try our Feedback Managment system for surveys, profiles, follow-up, and much more!
Click here http://www.sserve.com/fm.
Data
warehousing has quickly evolved into a unique and popular business
application class. Early builders of data warehouses already
consider their systems to be key components of their IT strategy and
architecture. Numerous examples can be cited of highly successful
data warehouses developed and deployed for businesses of all sizes
and all types. Hardware and software vendors have quickly developed
products and services that specifically target the data warehousing
market. This paper will introduce key concepts surrounding the data
warehousing systems.
What is a
data warehouse? A simple answer could be that a data warehouse is
managed data situated after and outside the operational systems. A
complete definition requires discussion of many key attributes of a
data warehouse system. Later in Section 2, we will identify these
key attributes and discuss the definition they provide for a data
warehouse. Section 3 briefly reviews the activity against a data
warehouse system. Initially in Section 1, however, we will take a
brief tour of the traditions of managing data after it passes
through the operational systems and the types of analysis generated
from this historical data.
Evolution of an application
class
This
section reviews the historical management of the analysis data and
the factors that have led to the evolution of the data warehousing
application class.
Traditional approaches
to historical data
In
reviewing the development of data warehousing, we need to begin with
a review of what had been done with the data before of evolution of
data warehouses. Let us first look at how the kind of data that ends
up in today’s data warehouses had been managed
historically.
Throughout
the history of systems development, the primary emphasis had been
given to the operational systems and the data they process. It is
not practical to keep data in the operational systems indefinitely;
and only as an afterthought was a structure designed for archiving
the data that the operational system has processed. The fundamental
requirements of the operational and analysis systems are different:
the operational systems need performance, whereas the analysis
systems need flexibility and broad scope. It has rarely been
acceptable to have business analysis interfere with and degrade
performance of the operational systems.
Data from legacy systems
In the
1970’s virtually all business system development was done on
the IBM mainframe computers using tools such as Cobol, CICS, IMS,
DB2, etc. The 1980’s brought in the new mini-computer
platforms such as AS/400 and VAX/VMS. The late eighties and early
nineties made UNIX a popular server platform with the introduction
of client/server architecture.
Despite
all the changes in the platforms, architectures, tools, and
technologies, a remarkably large number of business applications
continue to run in the mainframe environment of the 1970’s. By
some estimates, more than 70 percent of business data for large
corporations still resides in the mainframe environment. There are
many reasons for this. The most important reason, and one that is
particularly relevant to our topic, is that over the years these
systems have grown to capture the business knowledge and rules that
are incredibly difficult to carry to a new platform or
application.
These
systems, generically called legacy systems, continue to be the
largest source of data for analysis systems. The data that is stored
in DB2, IMS, VSAM, etc. for the transaction systems ends up in large
tape libraries in remote data centers. An institution will generate
countless reports and extracts over the years, each designed to
extract requisite information out of the legacy systems. In most
instances, IS/IT groups assume responsibility for designing and
developing programs for these reports and extracts. The time
required to generate and deploy these programs frequently turns out
to be longer than the end users think they can afford.
Extracted information on the
Desktop
During the
past decade, the sharply increasing popularity of the personal
computer on business desktops has introduced many new options and
compelling opportunities for business analysis. The gap between the
programmer and end user has started to close as Business Analysts
now have at their fingertips many of the tools required to gain
proficiency in the use of spreadsheets for analysis and graphic
representation. Advanced users will frequently use desktop database
programs that allow them to store and work with the information
extracted from the legacy sources. Many desktop reporting and
analysis tools are increasingly targeted towards end users and have
gained considerable popularity on the desktop.
The
downside of this model for business analysis is that it leaves the
data fragmented and oriented towards very specific needs. Each
individual user has obtained only the information that he or she
requires. Not being standardized, the extracts are unable to address
the requirements of multiple users and uses. The time and cost
involved in addressing the requirements of only one user prove
prohibitive. This approach to data management assumes the end user
has the time to expend on managing the data in the spreadsheets,
files, and databases. While many of these users may be proficient at
data management, most undertake these tasks as a necessity. And
given the choice, most users would find it more efficient to focus
on the actual analysis and the tools available to them.
Decision-Support and Executive Information
Systems
Another
category of popular analysis systems has been decision support
systems and executive information systems. Decision support systems
tend to focus more on detail and are targeted towards lower to
mid-level managers. Executive information systems have generally
provided a higher level of consolidation and a multi-dimensional
view of the data, as high level executives need more the ability to
slice and dice the same data than to drill down to review the data
detail.
These two
similar and overlapping categories are perhaps the closest
precursors to the data warehousing systems. Yet the high price of
their development and the coordination required for their production
made them an elite product that never entered the mainstream. The
following are some characteristics generally associated with
decision support or executive information systems:
- These
systems have data in descriptive standard business terms, rather
than in cryptic computer fields names. Data names and data
structures in these systems are designed for use by
non-technical users.
- The
data is generally preprocessed with the application of standard
business rules such as how to allocate revenue to products,
business units, and markets.
- Consolidated views of the data such as product, customer,
and market are available. Although these systems will at times
have the ability to drill down to the detail data, rarely are
they able to access all the detail data at the same time.
Today’s data warehousing systems provide the analytical
tools afforded by their precursors. But their design is no longer
derived from the specific requirements of analysts or executives;
and, as we will see later, data warehousing systems are most
successful when their design aligns with the overall business
structure rather than specific requirements.
Emergence of key
enabling technologies
Many
factors have influenced the quick evolution of the data warehousing
discipline. The most significant set of factors has been the
enormous forward movement in the hardware and software technologies.
Sharply decreasing prices and the increasing power of computer
hardware, coupled with ease of use of today’s software, has
made possible quick analysis of hundreds of gigabytes of information
and business knowledge.
Hardware prices plummeting according to the
Moore’s law
The most
important factor in the evolution of data warehousing has been the
sharply increasing power of computer hardware. Along with the
increase in this power, their prices have fallen just as sharply.
Gordon Moore, co-founder of Intel, predicted that the capacity of a
microprocessor will double every 18 months. This has not only held
true for the processor but also for other components of the
computer. While desktop computers today are more powerful than the
mainframes of yesterday, an inexpensive server possesses power that
was difficult to imagine just a decade ago.
The
Pentium II and Alpha processors have brought incredible power to the
commodity computer market. Sophisticated processor hardware
architectures such as symmetric multi-processing have come to the
mainstream computing with inexpensive machines. Higher capacity
memory chips, a key component influencing the performance of a data
warehouse system, are now available at very low prices. Now it is
possible to have a moderately priced machine with 1 or 2 gigabytes
of memory. Computer Bus such as PCI and controller interfaces such
as Ultra SCSI have made I/O incredibly fast. Last but not the least,
the disk drive has shrunk to hold amazing amounts of information.
Just two decades ago, it would have taken a roomful of disk drives
to store information that can now be easily stored on a single
one-inch high disk drive.
Desktop power increasing
Entering
the market as a novelty computer in early eighties, the personal
computer has become the hotbed in innovation during the past decade.
The personal computer was initially used for word processing and
other minor tasks with no links to primary analytical functions.
With the help of innovations such as powerful personal productivity
software, easy–to-use graphical interface, and responsive
business applications, the personal computer has become the focal
point of all computing today. The powerful desktop hardware and
software has allowed for development of the client/server or
multi-tier computing architecture. Almost all data warehouses are
accessed by personal computer based tools. These tools vary from
very simple query capabilities available with most productivity
packages to incredibly powerful graphical multi-dimensional analysis
tools. Without the wide array of choices available for a data
warehouse access, data warehousing would not have evolved so
quickly.
Ever increasing power of server
software
Server
operating systems such as Windows NT and Unix have brought
mission-critical stability and powerful features to the distributed
computing environment. The operating system software has become very
feature-rich and powerful as the cost has been going down steadily.
With this combination, sophisticated operating system concepts such
virtual memory, multi-tasking, and symmetric multi-processing are
now available on inexpensive operating platforms. Operating systems
such as Windows NT have made these powerful systems very easy to set
up and operate reducing the total cost of ownership of these
powerful servers.
Explosion of Intranets and Web based
applications
The most
important development in computing since the advent of the personal
computer is the explosion of Internet and Web based applications.
Somewhat after the fact, the business community has quickly jumped
onto the Internet bandwagon.
One of the
most exciting fields in computing industry today is the development
of Intranet applications. Intranets are private business networks
that are based on the Internet standards, although they are designed
to be used internally. The Internet/Intranet trend has very
important implications for data warehousing applications. First,
data warehouses can be available world wide on public/private
network at much lower cost. This availability minimizes the need to
replicate data across diverse geographical locations. Second, this
standard has allowed the web server to provide a middle tier where
all the heavy-duty analysis takes place before it is presented to
the web-browsing client to use.

The
skyrocketing power of hardware and software, along with the
availability of affordable and easy-to-use reporting and analysis
tools have played the most important role in evolution of data
warehouses. Figure 1 highlights the technological revolution that
has greatly impacted data warehousing.
Change in the nature of
the business
Another
very significant influence on evolution of data warehousing science
is the fundamental changes in the business organization and
structure during late eighties and early nineties. The emergence of
a vibrant global economy has profoundly changed the information
demands made by corporations in the United States and worldwide.
Corporations have found markets for their products globally while
competing with other companies in vastly different cultures and
economic environments. The mergers and acquisition of businesses
have crossed the country boundaries.
Economic factors of the recent
years
The
economic downturn of the late eighties led many global corporations
through a remarkable period of consolidation. Phenomena such as
"business process reengineering" and
"downsizing" forced businesses to reevaluate their
business practices. Many industries went through prolonged periods
of consolidation and reinvention. During this period, simple
economics forced the businesses to identify their core competency
areas and shed businesses that were not profitable.
These
economic factors have played an important role in the evolution of
data warehousing. For example, when a banking unit that used
different operational systems changed hands, the top management
still needed to view the consolidated business and manage the
associated risks accordingly. The banking industry has been the
leader in the use of data warehouses. Today’s data warehousing
systems are extensively used for profitability and customer behavior
analysis.
Global corporation
The fall
of communism and liberalization of Asian and South American
economies has changed the business climate worldwide forever.
Competition from emerging economies has forced large corporations to
become lean and efficient. The emergence of this global economy has
led to the migration of manufacturing to less expensive and less
restrictive countries. Former communist and South American countries
present very exciting and challenging business opportunities. Along
with these opportunities they present a very volatile business
climate and economies that are nearly impossible to
predict.
Businesses
have not only focused on building products worldwide, but they have
also changed their organization to sell products around the globe.
Trade agreements such as NAFTA and EEC greatly impact the decisions
to enter markets or build factories. This globalization of business
has increased the need not merely for more continuous analysis, but
also to manage data in a centralized location. The process of
rolling up manufacturing and sales data from far-flung business
units has now started to impact much larger number of corporations.
Businesses now need to continuously make the "build or
buy" decisions. Globalization of business has made the
consolidation of data in a central data warehouse more complicated.
Factors such as currency fluctuations and product customization for
different markets have added complexity to data warehousing, making
the analysis much more complicated. Imagine trying to assess
profitability of products built and sold in multiple countries with
volatile currencies. Or, attempting to hedge the risks of downturn
in economies that have been expanding rapidly for extended
periods.
Emergence of standard business
applications
Another
factor that is fast becoming an important variable in data
warehousing equations is the emergence of vendors with popular
business application suites. Led by wildly popular German software
vendor SAP AG, flexible business software suites adapted to the
particulars of a business have become a very popular way to move to
a sophisticated multi-tier architecture. Other vendors such as Baan,
PeopleSoft, and Oracle have likewise come out with suites of
software that provide different strengths but have comparable
functionality.
The
emergence of these application suites has a direct bearing on the
increased use of data warehousing in that they are increasingly able
to provide standard applications that are replacing existing custom
developed legacy applications. In the near future, almost every data
warehouse is likely to derive data from one of these application
sources rather than the customized extraction from legacy systems.
Further, there are significant initiatives at these vendors to make
transaction data easily available to data warehousing systems. To
the extent that these standard applications have extensive
customization features, data acquisition from these applications can
be much simpler than from the mainframe systems.
End-user more technology savvy
One of the
most important results of the massive investment in technology and
movement towards the powerful personal computer has been the
evolution of a technology-savvy business analyst. Even though the
technology-savvy end users are not always beneficial to all
projects, this trend certainly has produced a crop of
technology-leading business analysts that are becoming essential to
today’s business. These technology-savvy end users have
frequently played an important role in the development and
deployment of data warehouses. They have become the core users that
are first to demonstrate the initial benefits of data warehouses.
These end users are also critical to the development of the data
warehouse model: as they become experts with the data warehousing
system, they play a very important role of mentoring other
users.
Word
processing and spreadsheets were the first applications to be
effectively used on the personal computers. In fact, the spreadsheet
is said to be the killer application that led to widespread
deployment of personal computers. The charting functions from a
spreadsheet represent one of the most extensively used business
analysis and presentation functions. The new pivot tables available
in popular spreadsheets have allowed for simple multi-dimensional
analysis. The aggressive use of inexpensive personal productivity
software has led to use of more robust reporting and analysis tools
along with more powerful desktop database engines. These powerful
tools are now more targeted towards the end user and often require
very little training for simple applications.
Management more information
conscience
Many
factors affect the heightened awareness of trends in information
technology among mid and upper management levels. Unlike a decade
ago, the information technology now is nearly universally accepted
as a key strategic business asset. Many mid and upper level managers
that have risen through the ranks over the last decade have
invariably made their mark with successful technology investments.
As a result, they tend not to shy away from risking resources on new
and emerging technologies. The explosive use of Internet has greatly
aided in the managers’ awareness of technology trends. The
Internet is now being used to conduct business transactions; but its
greatest asset to this date has been dissemination of information.
Today, executives can not only review various sources of industry
trends, they can also readily find case studies and vendor
information.
The use of
technology by mid and upper level managers has increased
significantly. They have decisively moved beyond using the personal
computer for email. This hands-on use of information and technology
by upper management has facilitated the sponsorship of larger
projects such as data warehousing.

Alongside
the availability of key enabling technologies, these fundamental
changes in the nature of business over the past decade have played a
central role in the evolution of data warehouse. Some might even
argue that these changes in business have led the technology to its
current state.
Data warehousing attributes and
concepts
Having
looked at the historical use of the analysis data and explored some
of the factors influencing the evolution of data warehouses, we will
now turn to identifying the key attributes of a data warehouse.
It is
important to recognize that data warehousing is still an evolving
science. As with any evolving technology, particular care must be
taken to discount some marketing claims driven by vendors attempting
to differentiate themselves from the competitors. For example, the
size of the data warehouse should not determine if a data warehouse
is really a data warehouse. Some vendor may say that a data
warehouse that is only 50 gigabytes is not a full-fledged data
warehouse, and they may refer to it instead as a data mart. For a
smaller company, 50 gigabytes or even much less can represent every
relevant piece of information covering last 10 years and can well
represent a powerful data warehouse.
This
section explores the data warehousing concepts and attributes. These
concepts are grouped into four sub-sections. The first sub-section
discusses the reasons for separating the data for business analysis
from the operational data. The logical transformation of the data,
including data warehouse modeling and de-normalization of the data,
are introduced in the second sub-section. Sub-section three reviews
the issues associated with physical transformation of the data.
Sub-section four discusses the generation of summary views. A very
simple and broad definition of a data warehouse follows the
discussion of the data warehousing concepts and attributes.
"Warehousing"
data outside the operational systems
The
primary concept of data warehousing is that the data stored for
business analysis can most effectively be accessed by separating it
from the data in the operational systems. Many of the reasons for
this separation have evolved over the years. In the past, legacy
systems archived data onto tapes as it became inactive and many
analysis reports ran from these tapes or mirror data sources to
minimize the performance impact on the operational
systems.
These
reasons to separate the operational data from analysis data have not
significantly changed with the evolution of the data warehousing
systems, except that now they are considered more formally during
the data warehouse building process. Advances in technology and
changes in the nature of business have made many of the business
analysis processes much more complex and sophisticated. In addition
to producing standard reports, today’s data warehousing
systems support very sophisticated online analysis including
multi-dimensional analysis.
Integrating data from more than one operational
system
Data
warehousing systems are most successful when data can be combined
from more than one operational system. When the data needs to be
brought together from more than one source application, it is
natural that this integration be done at a place independent of the
source applications. Before the evolution of structured data
warehouses, analysts in many instances would combine data extracted
from more than one operational system into a single spreadsheet or a
database. The data warehouse may very effectively combine data from
multiple source applications such as sales, marketing, finance, and
production. Many large data warehouse architectures allow for the
source applications to be integrated into the data warehouse
incrementally.
The
primary reason for combining data from multiple source applications
is the ability to cross-reference data from these applications.
Nearly all data in a typical data warehouse is built around the time
dimension. Time is the primary filtering criterion for a very large
percentage of all activity against the data warehouse. An analyst
may generate queries for a given week, month, quarter, or a year.
Another popular query in many data warehousing applications is the
review of year-on-year activity. For example, one may compare sales
for the first quarter of this year with the sales for first quarter
of the prior years. The time dimension in the data warehouse also
serves as a fundamental cross-referencing attribute. For example, an
analyst may attempt to access the impact of a new marketing campaign
run during selected months by reviewing the sales during the same
periods. The ability to establish and understand the correlation
between activities of different organizational groups within a
company is often cited as the single biggest advanced feature of the
data warehousing systems.
The data
warehouse system can serve not only as an effective platform to
merge data from multiple current applications; it can also integrate
multiple versions of the same application. For example, an
organization may have migrated to a new standard business
application that replaces an old mainframe-based, custom-developed
legacy application. The data warehouse system can serve as a very
powerful and much needed platform to combine the data from the old
and the new applications. Designed properly, the data warehouse can
allow for year-on-year analysis even though the base operational
application has changed.
Differences between transaction and analysis
processes
The most
important reason for separating data for business analysis from the
operational data has always been the potential performance
degradation on the operational system that can result from the
analysis processes. High performance and quick response time is
almost universally critical for operational systems. The loss of
efficiency and the costs incurred with slower responses on the
predefined transactions are usually easy to calculate and measure.
For example, a loss of five seconds of processing time is perhaps
negligible in and of itself; but it compounds out to considerably
more time and high costs once all the other operations it impacts
are brought into the picture. On the other hand, business analysis
processes in a data warehouse are difficult to predefine and they
rarely need to have rigid response time requirements.
Operational systems are designed for acceptable performance
for pre-defined transactions. For an operational system, it is
typically possible to identify the mix of business transaction types
in a given time frame including the peak loads. It also relatively
easy to specify the maximum acceptable response time given a
specific load on the system. The cost of a long response time can
then be computed by considering factors such as the cost of
operators, telecommunication costs, and the cost of any lost
business. For example, an order processing system might specify the
number of active order takers and the average number of orders for
each operational hour. Even the query and reporting transactions
against the operational system are most likely to be predefined with
predictable volume.
Even
though many of the queries and reports that are run against a data
warehouse are predefined, it is nearly impossible to accurately
predict the activity against a data warehouse. The process of data
exploration in a data warehouse takes a business analyst through
previously undefined paths. It is also common to have runaway
queries in a data warehouse that are triggered by unexpected results
or by users’ lack of understanding of the data model. Further,
many of the analysis processes tend to be all encompassing whereas
the operational processes are well segmented. A user may decide to
explore detail data while reviewing the results of a report from the
summary tables. After finding some interesting sales activity in a
particular month, the user may join the activity for this month with
the marketing programs that were run during that particular month to
further understand the sales. Of course, there would be instances
where a user attempts to run a query that will try to build a
temporary table that is a Cartesian product of two tables containing
a million rows each! While an activity like this would unacceptably
degrade an operational system’s performance, it is expected
and planned for in a data warehousing system.
Data is mostly non-volatile
Another
key attribute of the data in a data warehouse system is that the
data is brought to the warehouse after it has become mostly
non-volatile. This means that after the data is in the data
warehouse, there are no modifications to be made to this
information. For example, the order status does not change, the
inventory snapshot does not change, and the marketing promotion
details do not change. This attribute of the data warehouse has many
very important implications for the kind of data that is brought to
the data warehouse and the timing of the data transfer.
Let us
further review what it means for the data to be non-volatile. In an
operational system the data entities go through many attribute
changes. For example, an order may go through many statuses before
it is completed. Or, a product moving through the assembly line has
many processes applied to it. Generally speaking, the data from an
operational system is triggered to go to the data warehouse when
most of the activity on these business entity data has been
completed. This may mean completion of an order or final assembly of
an accepted product. Once an order is completed and shipped, it is
unlikely to go back to backorder status. Or, once a product is built
and accepted, it is unlikely to go back to the first assembly
station. Another important example can be the constantly changing
data that is transferred to the data warehouse one snapshot at a
time. The inventory module in an operational system may change with
nearly every transaction; it is impossible to carry all of these
changes to the data warehouse. You may determine that a snapshot of
inventory carried once every week to the data warehouse is adequate
for all analysis. Such snapshot data naturally is
non-volatile.
It is
important to realize that once data is brought to the data
warehouse, it should be modified only on rare occasions. It is very
difficult, if not impossible, to maintain dynamic data in the data
warehouse. Many data warehousing projects have failed miserably when
they attempted to synchronize volatile data between the operational
and data warehousing systems.
Data saved for longer periods than in
transaction systems
Data from
most operational systems is archived after the data becomes
inactive. For example, an order may become inactive after a set
period from the fulfillment of the order; or a bank account may
become inactive after it has been closed for a period of time. The
primary reason for archiving the inactive data has been the
performance of the operational system. Large amounts of inactive
data mixed with operational live data can significantly degrade the
performance of a transaction that is only processing the active
data. Since the data warehouses are designed to be the archives for
the operational data, the data here is saved for a very long
period.
In fact, a
data warehouse project may start without any specific plan to
archive the data off the warehouse. The cost of maintaining the data
once it is loaded in the data warehouse is minimal. Most of the
significant costs are incurred in data transfer and data scrubbing.
Storing data for more than five years is very common for data
warehousing systems. There are industry examples were the success of
a data warehousing project has encouraged the managers to expand the
time horizon of the data stored in the data warehouse. They may
start with storing the data for two or three years and then expand
to five or more years once the wealth of business knowledge in the
data warehouse is discovered. The falling prices of hardware have
also encouraged the expansion of successful data warehousing
projects.

In short,
the separation of operational data from the analysis data is the
most fundamental data warehousing concept. Not only is the data
stored in a structured manner outside the operational system,
businesses today are allocating considerable resources to build data
warehouses at the same time that the operational applications are
deployed. Rather than archiving data to a tape as an afterthought of
implementing an operational system, data warehousing systems have
become the primary interface for operational systems. Figure 3
highlights the reasons for separation discussed in this
section.
Logical transformation
of operational data
This
sub-section explores the concepts associated with the data warehouse
logical model. The data is logically transformed when it is brought
to the data warehouse from the operational systems. The issues
associated with the logical transformation of data brought from the
operational systems to the data warehouse may require considerable
analysis and design effort. The architecture of the data warehouse
and the data warehouse model greatly impact the success of the
project. This section reviews some of the most fundamental concepts
of relational database theory that do not fully apply to data
warehousing systems. Even though most data warehouses are deployed
on relational database platforms, some basic relational principles
are knowingly modified when developing the logical and physical
model of the data warehouses.
Structured extensible data
model
The data
warehouse model outlines the logical and physical structure of the
data warehouse. Unlike the archived data of the legacy systems,
considerable effort needs to be devoted to the data warehouse
modeling. This data modeling effort in the early phases of the data
warehousing project can yield significant benefits in the form of an
efficient data warehouse that is expandable to accommodate all of
the business data from multiple operational applications.
The data
modeling process needs to structure the data in the data warehouse
independent of the relational data model that may exist in any of
the operational systems. As discussed later in this paper, the data
warehouse model is likely to be less normalized than an operational
system model. Further, the operational systems are likely to have
large amounts of overlapping business reference data. Information
about current products is likely to be used in varying forms in many
of the operational systems. The data warehouse system needs to
consolidate all of the reference data. For example, the operational
order processing system may maintain the pricing and physical
attributes of products whereas the manufacturing floor application
may maintain design and formula attributes for the same product. The
data warehouse reference table for products would consolidate and
maintain all attributes associated with products that are relevant
for the analysis processes. Some attributes that are essential to
the operational system are likely to be deemed unnecessary for the
data warehouse and may not be loaded and maintained in the data
warehouse.

The data
warehouse model needs to be extensible and structured such that the
data from different applications can be added as a business case can
be made for the data. A data warehouse project in most cases cannot
include data from all possible applications right from the start.
Many of the successful data warehousing projects have taken an
incremental approach to adding data from the operational systems and
aligning it with the existing data. They start with the objective of
eventually adding most if not all business data to the data
warehouse. Keeping this long-term objective in mind, they may begin
with one or two operational applications that provide the most
fertile data for business analysis. Figure 4 illustrates the
extensible architecture of the data warehouse.
Data warehouse model aligns with the business
structure
A data
warehouse logical model aligns with the business structure rather
than the data model of any particular application. The entities
defined and maintained in the data warehouse parallel the actual
business entities such as customers, products, orders, and
distributors. Different parts of an organization may have a very
narrow view of a business entity such as a customer. For example, a
loan service group in a bank may only know about a customer in the
context of one or more loans outstanding. Another group in the same
bank may know about the same customer in context of a deposit
account. The data warehouse view of the customer would transcend the
view from a particular part of the business. A customer in the data
warehouse would represent a bank customer that has any kind of
business with the bank.
The data
warehouse would most likely build attributes of a business entity by
collecting data from multiple source applications. Consider, for
example, the demographic data associated with a bank customer. The
retail operational system may provide some attributes such as social
security number, address, and phone number. A mortgage system or
some purchased database may provide with employment, income, and net
worth information.
The
structure of the data in any single source application is likely to
be inadequate for the data warehouse. The structure in a single
application may be influenced by many factors, including:
- Purchased Applications: The application data structure may
be dictated by an application that was purchased from a software
vendor and integrated into the business. The user of the
application may have very little or no control over the data
model. Some vendor applications have a very generic data model
that is designed to accommodate a large number and types of
businesses.
- Legacy Application: The source application may be a very
old mostly homegrown application where the data model has
evolved over the years. The database engine in this application
may have been changed more than once without anyone taking the
time to fully exploit the features of the new engine. There are
many legacy applications in existence today where the data model
is neither well documented nor understood by anyone currently
supporting the application.
- Platform Limitations: The source application data model
may be restricted by the limitations of the hardware/software
platform or development tools and technologies. A database
platform may not support certain logical relationship or there
may be physical limitations on the data attributes.

Figure 5
illustrates the alignment of data warehouse entities with the
business structure. The data warehouse model breaks away from the
limitations of the source application data models and builds a
flexible model that parallels the business structure. This
extensible data model is easy to understand by the business analysts
as well as the managers.
Transformation of the operational state
information
It is
essential to understand the implications of not being able to
maintain the state information of the operational system when the
data is moved to the data warehouse. Many of the attributes of
entities in the operational system are very dynamic and constantly
modified. Many of these dynamic operational system attributes are
not carried over to the data warehouse; others are static by the
time they are moved to the data warehouse. A data warehouse
generally does not contain information about entities that are
dynamic and constantly going through state changes.
To
understand what it means to lose the operational state information,
let us consider the example of an order fulfillment system that
tracks the inventory to fill orders. First let us look at the order
entity in this operational system. An order may go through many
different statuses or states before it is fulfilled or goes to the
"closed" status. Other order statuses may indicate that
the order is ready to be filled, it is being filled, back ordered,
ready to be shipped, etc. This order entity may go through many
states that capture the status of the order and the business
processes that have been applied to it. It is nearly impossible to
carry forward all of attributes associated with these order states
to the data warehousing system. The data warehousing system is most
likely to have just one final snapshot of this order. Or, as the
order is ready to be moved into the data warehouse, the information
may be gathered from multiple operational entities such as order and
shipping to build the final data warehouse order entity.
Now let us
consider the more complicated example of inventory data within this
system. The inventory may change with every single transaction. The
quantity of a product in the inventory may be reduced by an order
fulfillment transaction or this quantity may be increased with
receipt of a new shipment of the product. If this order processing
system executes ten thousand transactions in a given day, it is
likely that the actual inventory in the database will go through
just as many states or snapshots during this day. It is impossible
to capture this constant change in the database and carry it forward
to the data warehouse. This is still one of the most perplexing
problems with the data warehousing systems. There are many
approaches to solving this problem. You will most likely choose to
carry periodical snapshots of the inventory data to the data
warehouse. This scenario can apply to a very large portion of the
data in the operational systems. The issues associated with this get
much more complicated as extended time periods are
considered.

Figure 6
illustrates how most of the operational state information cannot be
carried over the data warehouse system.
De-normalization of data
Before we
consider data model de-normalization in the context of data
warehousing, let us quickly review relational database concepts and
the normalization process. E. F. Codd developed relational database
theory in the late 1960s while he was a researcher at IBM. Many
prominent researchers have made significant contributions to this
model since its introduction. Today, most of the popular database
platforms follow this model closely. A relational database model is
a collection of two-dimensional tables consisting of rows and
columns. In the relational modeling terminology, the tables, rows,
and columns are respectively called relations, attributes, and
tuples. The name for relational database model is derived from the
term relation for a table. The model further identifies unique keys
for all tables and describes the relationship between
tables.
Normalization is a relational database modeling process where
the relations or tables are progressively decomposed into smaller
relations to a point where all attributes in a relation are very
tightly coupled with the primary key of the relation. Most data
modelers try to achieve the "Third Normal Form" with all
of the relations before they de-normalize for performance or other
reasons. The three levels of normalization are briefly described
below:
- First
Normal Form: A relation is said to be in First Normal Form if it
describes a single entity and it contains no arrays or repeating
attributes. For example, an order table or relation with
multiple line items would not be in First Normal Form because it
would have repeating sets of attributes for each line item. The
relational theory would call for separate tables for order and
line items.
- Second Normal Form: A relation is said to be in Second
Normal Form if in addition to the First Normal Form properties,
all attributes are fully dependent on the primary key for the
relation.
- Third
Normal Form: A relation is in Third Normal Form if in addition
to Second Normal Form, all non-key attributes are completely
independent of each other.
The
process of normalization generally breaks a table into many
independent tables. While a fully normalized database can yield
fantastically flexible model, it generally makes the data model more
complex and difficult to follow. Further, a fully normalized data
model can perform very inefficiently. A data modeler in an
operational system would take normalized logical data model and
convert it into a physical data model that is significantly
de-normalized. De-normalization reduces the need for database table
joins in the queries.
Some of
the reasons for de-normalizing the data warehouse model are the same
as they would be for an operational system, namely, performance and
simplicity. The data normalization in relational databases provides
considerable flexibility at the cost of the performance. This
performance cost is sharply increased in a data warehousing system
because the amount of data involved may be much larger. A three-way
join with relatively small tables of an operational system may be
acceptable in terms of performance cost, but the join may take
unacceptably long time with large tables in the data warehouse
system.
Static relationships in historical
data
Another
reason that de-normalization is an important process in data
warehousing modeling is that the relationship between many
attributes does not change in this historical data. For example, in
an operational system, a product may be part of the product group
"A" this month and product group "B" starting
next month. In a properly normalized data model, it would be
inappropriate to include the product group attribute with an order
entity that records an order for this product; only the product ID
would be included. The relational theory would call for a join on
the order table and product table to determine the product group and
any other attributes of this product. This relational theory concept
does not apply to a data warehousing system because in a data
warehousing system you may be capturing the group that this product
belonged to when the order was filled. Even though the product moves
to different groups over time, the relationship between the product
and the group in context of this particular order is
static.
Another
important example can be the price of a product. The prices in an
operational system may change constantly. Some of these price
changes may be carried to the data warehouse with a periodic
snapshot of the product price table. In a data warehousing system
you would carry the list price of the product when the order is
placed with each order regardless of the selling price for this
order. The list price of the product may change many times in one
year and your product price database snapshot may even manage to
capture all these prices. But, it is nearly impossible to determine
the historical list price of the product at the time each order is
generated if it is not carried to the data warehouse with the order.
The relational database theory makes it easy to maintain dynamic
relationships between business entities, whereas a data warehouse
system captures relationships between business entities at a given
time.

Logical
transformation concepts of source application data described here
require considerable effort and they are a very important early
investment towards development of a successful data warehouse.
Figure 7 highlights the logical transformation concepts discussed in
this section.
Physical transformation
of operational data
Physical
transformation of data homogenizes and purifies the data. These data
warehousing processes are typically known as "data
scrubbing" or "data staging" processes. The
"data scrubbing" processes are some of the most
labor-intensive and tedious processes in a data warehousing project.
Yet, without proper scrubbing, the analytical value of even the
clean data can be greatly diminished. Physical transformation
includes the use of easy-to-understand standard business terms, and
standard values for the data. A complete dictionary associated with
the data warehouse can be a very useful tool. During these physical
transformation processes the data is sometimes "staged"
before it is entered into the data warehouse. The data may be
combined from multiple applications during this "staging"
step or the integrity of the data may be checked during this
process.
The
concepts associated with the physical transformation of the data are
introduced in this sub-section. Historical data and the current
operational application data is likely to have some missing or
invalid values. It is important to note that it is essential to
manage missing values or incomplete transformations while moving the
data to the data warehousing system. The end user of the data
warehouse must have a way to learn about any missing data and the
default values used by the transformation processes.
Operational terms transformed into uniform
business terms
The terms
and names used in the operational systems are transformed into
uniform standard business terms by the data warehouse transformation
processes. The operational application may use cryptic or difficult
to understand terms for a variety of different reasons. The platform
software may impose length and format restriction on a term, or
purchased application may be using a term that is too generic for
your business. The data warehouse needs to consistently use standard
business terms that are self-explanatory.
A customer
identifier in the operational systems may be called cust, cust_id,
or cust_no. Further, different operational applications may use
different terms to refer to the same attribute. For example, a
customer in the loan organization in a bank may be referred to as a
Borrower. You may choose a simple standard business term such as
Customer Id in the data warehouse. This term would require little or
no explanation even to the novice user of the data
warehouse.
Single physical definition of an
attribute
Different
systems may evolve to use different lengths and data types for the
same data element. One system may have the product ID to be either
12 or 14 numeric characters, whereas another system may accommodate
product IDs of up to 18 alphanumeric characters. The software of an
operational application may support very limited data types and it
may impose severe limitations on the names. Software of another
application may support a very rich set of data types, and it may be
very flexible with the naming conventions.
As an
attribute is defined physically for the data warehouse, it is
essential to use meaningful data types and lengths. Use the standard
data length and data type for each attribute everywhere it is used.
A functional data dictionary can facilitate this consistent use of
physical attributes.
Consistent use of entity attribute
values
All
attributes in the data warehouse need to be consistent in the use of
predefined values. Different source applications invariably use
different attribute values to represent the same meaning. These
different values need to be converted into a single, most sensible
value as the data is loaded into the data warehouse.
A simple
example for the consistent use of entity attributes is the use of a
gender flag for an individual. One source application may use flags
such as "M" and "F" to store gender for an
individual whereas another application may use the detail
"Male" and "Female" to store gender. Other
applications may use yet other values to store the same piece of
information. The data warehouse may choose to consistently use
"M" and "F" for gender for all individuals
throughout the system.
A more
complex example can be the case of dealing with complex data values
in the source application. Many older applications use single data
value to represent multiple attributes. An account number, for
example, may not only represent a unique account but it may also
represent the account type. All accounts starting with 1 or 2 may
represent one type of account whereas all other accounts may
represent something else to the business. The data warehouse would
consistently use the account ID to only represent a unique account.
The account type may be computed and saved as a separate
attribute.
Issues associated with default and missing
values
The data
brought into the data warehouse is sometimes incomplete or contains
values that cannot be transformed properly. It is very important for
the data warehouse transformation process to use intelligent default
values for the missing or corrupt data. It is also important to
devise a mechanism for users of the data warehouse to be aware of
these default values.
Some data
attributes can easily be defaulted to a reasonable value when the
original is missing or corrupt. Other values can be obtained by
referencing other current data. For example, a missing product
attribute such as unit-of-measure on an order entity can be obtained
by accessing the current product database. Some attributes cannot be
filled by defaults for missing values. In fact, it may be dangerous
to attempt to assign default for certain types of missing values. A
poor default may corrupt the data and lead to invalid analysis at a
later stage. In these cases, it is safest to leave the missing
values as blank. In some cases, it may make sense to pick a specific
value or symbol that indicates a missing value.
The timing
of the start of the period for which data is loaded into the data
warehouse can be important. It is safest to load data in the data
warehouse for complete years. This would prevent any
misinterpretation of analysis run on this data. Imagine a data
warehouse that started loading data from the month of March for the
first year in the data warehouse. It is very likely that a user is
going to run a query for a range of whole year without realizing
that the data for January and February is not stored in the data
warehouse. Also, missing data for part of the year prevents any
meaningful year-on-year analysis.
It is
important to design a good system to log and identify data that is
missing from the data warehouse. When a user runs a query against
the data warehouse, it is essential to understand the population
against which the query is run.

Figure 8
highlights the physical transformation concepts for data warehousing
systems. Physical transformation of source application data requires
considerable effort and it can be difficult at times, but a
well-considered set of physical data transformations can make a data
warehouse more user-friendly. Further, accurate and complete
transformations help maintain the integrity of the data
warehouse.
Business view
summarization of data
Many
queries and reports against most data warehouse systems are simple
aggregations based on predefined parameters. Another key attribute
of today’s data warehouses is the predefined and automatically
generated summary views.
For
example, many people in an organization may need to see product
sales figures. They may have a need to summarize these sales figures
for a week, a month, or a quarter. It may not be practical to
summarize the needed data every time an analyst requires it. A data
warehouse that contains summary views of the detail data around the
most common queries can sharply reduce the amount of processing
needed at the time of analysis. Summary views are typically created
around business entities such as customers, products, and
channels.
The
summary views also hide the complexities of the detail data. Of
course, performance gain is the most significant tangible aspect of
the summary views in the data warehouse. Most relational databases
provide the ability to build views for users that hide the
underlying tables. In most SQL server packages, including MS SQL
Server, the view exists only as a definition and it is created at
the time it is actually used. While the concept of summary views in
data warehousing systems is similar, it important to not confuse
data warehousing summary views with the term "views" as it
is used in a database system. A summary view in a data warehouse
refers to an actual table that is created and maintained independent
of when it used by a user. The key concepts around the summary views
are introduced in this section.
Initial analysis in summary
views
Summary
views often are generated not only by summarizing the detail data
but also by applying business rules to the detail data. For example,
the summary views may contain a filter that applies the exact
business rules for considering an order a sale or a filter that
applies the business rules for allocating a sale to a channel
entity. The summary views can hide the complexities of the detail
data from the end user for many, if not most, analysis
tasks.
The
business rules that are applied in generating summary views can be
complex. These business rules may determine exactly what constitutes
a sale or they may determine how a sale is allocated to a sales or
channel entity. Large organizations often have complex rules to
charge sales to different ledger accounts. Some sales may be
allocated to warranty replacement and thus not be counted as sales.
Or, some sales may be further discounted based on a master contract
with the customer and thus need to be reduced when calculating
product sales for a period. Often, a data warehouse will have more
than one view based on business entities such as customers and
products. There may be multiple physical tables or the same table
may contain additional attributes that allow for easy
queries.
In
addition to applying the business rules while generating summary
views, the data warehousing system may perform complex database
operations such as multi-table joins. Product sales may be computed
by joining the Sales, Invoice, and Product tables. The criteria to
join these tables may be complex. While individuals mining data in
the warehouse detail records need to understand all the complexities
of business rules, most users can retrieve effective summary
business information without fully understanding the detail
data.
Significant performance gains
The single
most important reason for building the summary views is the
significant performance gains they facilitate. Not only are all the
complexities of detail data interpreted for an end user; the summary
views also perform the most time-consuming data analysis before it
is needed.
Summary
views allow you to run a product sales query by merely setting up a
filter based on indexed fields such as date, product codes, and
other relevant criteria. Further, this query will most likely run on
a sharply smaller table as the summary views would have reduced the
data from multiple tables containing millions of rows to tens of
thousands of rows. A query against this smaller table would be
significantly faster than a query that runs against detail tables
joined for the query. In some instances a summary view table can be
as large as the detail tables. This may be caused by summarization
in very small units or combining multiple summary views into one
data table. For example, you may not be able to summarize the
product sales by week. Instead daily product sales figures may be
required for some queries. Even in these large summary views, the
performance is generally better because many of the table joins are
eliminated and queries can generally use the indexes.
Many views into the same detail
The
summary views in a data warehouse provide multiple views into the
same detail data. These views are predefined dimensions into the
detail data. These views provide an efficient method for the analyst
to link with the detail data when necessary.
For
example, for the sales order data, four different product sales
summary views could be generated summarizing weekly sales data.
These views summarizing by product, customer, channel, and region
all include the same detail data and they would need to be updated
or regenerated as new data is brought into the data
warehouse.
Even
though most of the analysis is likely to be done using the summary
views, there needs to be a simple and robust way for an analyst to
drill down into the detail data. Many business problems require
review of the detail data to fully understand a pattern or anomaly
exhibited in the summarized reports or queries. Drill down from many
different summary views can lead to the same detail data. A single
anomaly in detail data may manifest itself differently in different
summary views.

Summarization and predefined analysis of data in a data
warehouse system is an important task. It is essential to maintain
the integrity of the summary views because a very large part of the
data warehouse activity is against the summary views. Figure 9
highlights the key concepts around summary views. The summary views
need to be not only designed and built, they need to be maintained
as new data comes into the data warehouse.
Definition
After
considering the various attributes and concepts of data warehousing
systems, a broad definition of a data warehouse can be the
following:
A data warehouse is a structured extensible
environment designed for the analysis of non-volatile data,
logically and physically transformed from multiple source
applications to align with business structure, updated and
maintained for a long time period, expressed in simple business
terms, and summarized for quick analysis.
Business use of a data
warehouse
No
discussion of the data warehousing systems is complete without
review of the type of activity supported by a data warehouse. Some
of the activity against today’s data warehouses is predefined
and not much different from traditional analysis activity. Other
processes such as multi-dimensional analysis and information
visualization were not available with traditional analysis tools and
methods.
There is a
very interesting phenomenon that is observed with many data
warehousing projects. The users of a new data warehouse only wish to
get the information that they were able to get using the old tools
and methods. They wish to replicate their queries and reports with
the data warehouse and make sure that all the numbers match. Often
there is as much apprehension of the new tools and the data
warehouse as there is excitement. It is only after using the new
data warehouse for a period of time that they start to explore and
discover the new capabilities that are available to them. Soon
after, they start to have significant input into the data warehouse
enhancement process and they happily become the mentors for the new
users.
Tools to be used
against the data warehouse
One of the
objectives of the data warehouse is to make it as flexible and as
open as possible. It is not desirable to set a steep entry price in
terms of software and training for using the data warehouse. The
data warehouse should be accessible by as many end-user tools and
platforms as possible. Yet, it is not possible to make every feature
of the data warehouse available from every end user tool.
Low-end
tools such as simple query capability built into most spreadsheets
may be adequate for a user that only needs to quickly reference the
data warehouse. Other users may require the use of the most powerful
multi-dimensional analysis tools. The data warehouse administrators
need to identify the tools that are supported for access to the data
warehouse and the capabilities that are available using these
different tools. There can be a progression path to the higher level
tools for the data warehouse users. A user can start with a
low-level tool that is already familiar to him or her. After
becoming familiar with the data warehouse he or she may be able to
justify the cost and effort involved with using a more complex
tool.
In most
data warehousing projects, there is a need to select a preferred
data warehouse access tool for the most active users. A small number
of users generate most of the analysis activity against the data
warehouse. The data warehouse performance can be tuned to the
requirements of the tool appropriate for these active users. This
tool can be used for training and demonstration of the data
warehouse.
Standard reports and
queries
Many users
of the data warehouse need to access a set of standard reports and
queries. It is desirable to periodically automatically produce a set
of standard reports that are required by many different users. When
these users need a particular report, they can just view the report
that has already been run by the data warehouse system rather than
running it themselves. This facility can be particularly useful for
reports that take a long time to run.
Such a
facility would require report server software. It is likely that
these reports can be accessed only using the client program for that
system. This facility would need to work with or be part of the
preferred data warehouse access tool previously mentioned. Many end
user query and analysis tools now include server software that can
be run with the data warehouse to serve reports and query results.
These tools are now providing a web interface to the reports. In
many data warehouse systems, this report and query server becomes an
essential facility. The data warehouse users and administrators
constantly need to consider any reports that are candidates to
become standard reports for the data warehouse. Frequently,
individual users may develop reports that can be used by other
users.
In
addition to standard reports and queries, sometimes it is useful to
share some of the advanced work done by other users. A user may
produce advanced analysis that can be parameterized or otherwise
adapted by other users in different parts of the same organization
or even in organizations.
Queries against summary
tables
As
introduced earlier, the summary views in the data warehouse can be
the object of a large majority of analysis in a data warehouse.
Simple filtering and summation from the summary views accounts for
most of the analysis activity against many data warehouses. These
summary views contain predefined standard business
analysis.
For
example, in a typical data warehouse, the product summary view may
account for a very large number of queries where different users
select different products and the time periods for product sales and
profit margin queries. These queries provide quick response and they
are very simple to build. Advanced users typically attach a pivot
table in their analysis tool to data warehouse summary tables for
simple multi-dimensional analysis.
Data mining in the
detail data
Even
though data mining in the detail data may account for a very small
percentage of the data warehouse activity, the most useful data
analysis might be done on the detail data. The reports and queries
off the summary tables are adequate to answer many "what"
questions in the business. The drill down into the detail data
provides answers to "why" and "how"
questions.
Data
mining is an evolving science. A data-mining user starts with
summary data and drills down into the detail data looking for
arguments to prove or disprove a hypothesis. The tools for data
mining are evolving rapidly to satisfy the need to understand the
behavior of business units such as customers and
products.
Interface with other
data warehouses
The data
warehouse system is likely to be interfaced with other applications
that use it as the source of operational system data. A data
warehouse may feed data to other data warehouses or smaller data
warehouses called data marts.
The
operational system interfaces with the data warehouse often become
increasingly stable and powerful. As the data warehouse becomes a
reliable source of data that has been consistently moved from the
operational systems, many downstream applications find that a single
interface with the data warehouse is much easier and more functional
than multiple interfaces with the operational applications. The data
warehouse can be a better single and consistent source for many
kinds of data than the operational systems. It is however, important
to remember that the much of the operational state information is
not carried over to the data warehouse. Thus, data warehouse cannot
be source of all operation system interfaces.

Figure 10
illustrates the analysis processes that run against a data
warehouse. Although a majority of the activity against today’s
data warehouses is simple reporting and analysis, the sophistication
of analysis at the high end continues to increase rapidly. Of
course, all analysis run at data warehouse is simpler and cheaper to
run than through the old methods. This simplicity continues to be a
main attraction of data warehousing systems.
Summary
This paper
introduced the fundamental concepts of data warehousing. It is
important to note that data warehousing is a science that continues
to evolve. Many of the design and development concepts introduced
here greatly influence the quality of the analysis that is possible
with data in the data warehouse. If invalid or corrupt data is
allowed to get into the data warehouse, the analysis done with this
data is likely to be invalid.
After the
rapid acceptance of data warehousing systems during past three
years, there will continue to be many more enhancements and
adjustments to the data warehousing system model. Further evolution
of the hardware and software technology will also continue to
greatly influence the capabilities that are built into data
warehouses.
Data
warehousing systems have become a key component of information
technology architecture. A flexible enterprise data warehouse
strategy can yield significant benefits for a long
period. |