A detailed assessment and evaluation of data warehouse system functionality and how it applies to the dimensional data model using tools that the architect works with. A detailed assessment and evaluation of data warehouse system functionality and how it applies to the dimensional data model using tools that the architect works with. A detailed assessment and evaluation of data warehouse system functionality and how it applies to the dimensional data model using tools that the architect works with.
Trang 2Table of contents
3.1 The Entity Relationship Model Form 5 3.2 An Organized Performance Architecture Response 6
6.1 Function Limiting Characteristics the Dimensional Form 11
6.1.1 The Dimensional Form Does Not Extend Well 11 6.1.2 The Dimensional Form Is Not Flexible 12 6.1.3 The Form Does Not Describe the Business 13
7.1 Client A 15
7.2 Client B 15
8 System Architecture Form to Fulfill Multiple Functions 21
8.2 Integrating Model Form with Technology Form 22
Trang 3
1 Introduction
"It is the pervading law of all things organic and inorganic, of all things physical and
metaphysical, of all things human and all things superhuman, of all true manifestations of the head, of the heart, of the soul, that the life is recognizable in its expression, that form ever
Louis Sullivan
“Form follows function - that has been misunderstood Form and function should be one,
joined in a spiritual union.”
Frank Lloyd Wright
To be an architect of information solutions is to understand the concept of form following function intuitively, as a matter of nature, because design (creation of form) is about enabling informational function Taking the title ―architect‖ affirms one‘s conscious method design based decision process in terms of aligning form with functional needs
As one examines form‘s relationship to function within the dimensional model, the evaluation
of the model form must not be based solely on Sullivan‘s statement, but on Wright‘s; form not only follows function, but function follows form
The concept of form and function unity highlights that form is not only based on function, but also limits it, many times strictly Form and function are bound together in a cause and effect relationship; function is the cause of the form, while form both facilitates function and limits it When considering the data warehouse function, one considers the overall goal to delivery information, allowing the business to measure its activity and understand the impacts of its actions in the market place This high-level statement of function though, is far too general for the evaluation of model form As will be demonstrated, a more detailed understanding of system functionality is needed before determining model form application
The function-limiting impact of form is often overlooked in design, particularly data model design By implementing a specific design form, are the broader limits on function considered? What system design steps are needed to mitigate those limitations?
Too often data practitioners apply the form they know best, the latest form they‘ve come to appreciate or a form that is deemed a ―best practice‖ in their circles
True architects are not practitioners of ―best practices‖ They practice the application of forms
to function based on principles derived from cause and effect analysis
The architect studies the relationship of form and function, of cause and effect and then applies forms specific to the required functions The architect deals with the complexity of the
Trang 42 Model Characteristics
One generally thinks about a model form in terms of certain characteristics Through the
evaluation these characteristics and examination of model form, it becomes evident how they align with, support and limit function in relationship to data and information delivery
o The model‘s ability to extend
to extend a data model for new content/capability without disruption and redesign of processes
o The model‘s ability to be flexible
to support multiple purposes or functions
o The model‘s ability to describe the business and subjects within the corporate
structure
to document the business using data
o The model‘s ability to support any valid business question
to answer business questions without specific design structuring
not a matter of ease or performance but a matter of ability
o The model‘s ability to efficiently and quickly answer business questions (report query performance)
to provide acceptable query performance for corporate decision support and analysis
o The model‘s ability to demonstrate business performance
to measure business performance
The critical examination of limiting aspects to the dimensional model gives the architect the foundational principles necessary to understand the application of dimensional form in
Information Architecture solutions
Trang 53 Dimensional Model Architectural Origins
The dimensional model form is designed to greatly simplify database optimization for queries that would otherwise be applied against an Entity Relationship (ER) model Because the
dimensional model is a design response used to overcome ER form limits, there must first be examination of the ER form and its characteristics as a comparison basis
3.1 The Entity Relationship Model Form
1 To free the collection of relations from undesirable insertion, update and deletion
dependencies;
2 To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
3 To make the relational model more informative to users;
4 To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by
— E.F Codd, "Further Normalization of the Data Base Relational Model"
Each of Codd‘s goals not only provides insight to ER model function, but are also instructive as
to the reasons for the dimensional model form
The Data Architect produces an ER model that describes the business through ―Entities‖
representing each of the objects, actors, organizational fictions, contracts, business activities and others in the business landscape If it can be named as a subject, it must be represented
as an entity within the model Each entity is given an identifier known as the primary key
Additional attributes are added to describe only the primary key
Foreign key relationships document each business relationship existing between entities These relationships are instilled in the model logically rather than by direct data association This distinction is fundamental to the examination of the ER and Dimensional Model form
characteristics and its ability to deliver specific functionality
This examination won‘t delve into the application of normalization rules, except to state that many modelers deal with normalization intuitively as a matter of entity definition and
evaluation of attribute when creating the ER model Normalization rules represent a method of thinking regarding the evaluation of data content in model development Normalization
ensures all entities are defined purely and that all business relationships within the model are defined logically rather than by physical association
Trang 6As one examines Codd‘s goals it is obvious that they align with some of the model
characteristics previously discussed Those characteristics are:
extensibility
flexibility
ability to describe the subject
ability to support any valid business question
Cobb‘s fourth goal may appear somewhat cryptic, but is central to an architect‘s
understanding of both model forms and support of Codd‘s preceding goals
In a fully normalized model there is no statistical data relationship bias that emphasizes one relationship or eliminates another, because relationships are implemented logically Data that
is not normalized, associates data physically on the same row, creating a bias When data is organized this way, certain questions can be answered, while others cannot
Applying rules of normalization ensures no bias exists for one type of business question or
another
One can ask any valid business question of a normalized model Based on the model‘s
logically implemented relationships, (foreign key) one will always get the answer There is no need to know future questions It will always work if each entity is represented within the model that is germane to the question and each relationship between the entities documented logically As long as one is willing to write the necessary queries and wait, the model will
answer
Therefore, the normalized entity relationship model form is designed for flexibility, to answer any business question It eliminates relationship bias by describing each entity purely and documenting all business relationship logically, providing data relationship neutrality
Extensibility is another outcome of eliminating relational bias, as will be seen later
The normalized form that gives us this functionality also limits function To answer more than simple business questions, complex queries need to be written with many joins that follow relational paths, and identifying specific content within data sets using correlated sub-queries The query may need to do mixed aggregation to common group by levels as well as use outer joins complicating query optimization Temp tables and multiple query steps may need to be used in some cases In data warehousing, all of this complex query optimization results in issues
of access and join serialization in relationship to lots of I/O from large data reading, buffering and sorting
No one wants to wait hours for BI report results In the early days of data warehousing, on at least one RDMBS, the longer the query ran, the more likely it would end in error due to the database‘s concurrency architecture
3.2 An Organized Performance Architecture Response
At the time of Ralph Kimball‘s first edition release of The Data Warehouse Toolkit, most data warehouse servers were hosted on SMP database servers These types of servers do not scale
Trang 7parallel processing linearly as MPP clusters do, and often led to a variety of very limiting data forms that were intended to improve query performance
The introduction of the dimensional model provided an organized, systematic design basis for
a performance architecture form leading to predictable query optimization
It also addressed another issue at the time; it‘s much simpler to write queries against Hand coding queries against an ER model for any sort of complicated reporting requires a good deal of skill, experience and time While users still need to write manual queries, Business
Intelligence software has diminished that by supporting metadata driven abstraction that interprets the physical data model for the user
When dimensional models are designed properly for reporting they require only selection of attributes and measure required, direct join to dimensions needed, application of WHERE or JOIN filters, appropriate aggregate functions and GROUP BY clauses (and perhaps a HAVING clause.)
Trang 84 The Dimension Model Form
Dimensional modeling achieves its performance advantage by designing denormalizations into data organizations specific to answering a limited range of business questions These denormalizations take the form of placing data in physical relationships and eliminating the logical business-based relationships that follow an entity-to-entity-to-entity form, in favor of more direct report grouping reference relationship to business metrics
In other words, the dimensional model form creates explicit relationship biases to simplify
queries, reduce I/O and eliminate query optimization complexity, which delivers answers to business questions efficiently and quickly
The pattern of denormalization follows the form of a central table called a fact table
containing one or more business measurements called facts The facts may be sourced from a variety of transactional and reference sources, all of which may be used in combination to answer certain classes of business questions
The fact table row always has the context of a time period, either date or time together The time period may be either date or higher level time period, such as week, month, quarter or year Facts maybe transactional, a point-in-time snapshot state of metrics or period-based aggregate
The fact table also has foreign key relationship attributes relating the fact rows to reference tables called dimensions Dimensions may represent a single entity identity of data, but
typically contain attributes from, or derived from, multiple entities describing a subject
Typically there is at least one dimension associated with the fact table that has at its basis in on
an entity with a natural business-based relationship to the business activity represented in facts
of the fact table There are usually other dimension relationships that are one or two entities removed from the business activity documented in the fact table There may also be
additional dimensions related to the facts that must be derived by processing other business activity
Keep in mind that if a source does not actually document all of the data relationships, for example the customer‘s origination sales channels, then these relationships must be derived from processing business activity records, such as sales or service orders
One must also build into the process and structure of the star schema all of the complex
processing that would be needed in against the entity relationship model to bring data up to common simplified form, fit to answering functionally similar business questions
The philosophy of the dimensional model is to do all of processing once to form a common basis for a class of business questions or analysis, storing the results of that process in the star schema so that BI queries avoid that complex process at report runtime It is a ‗process once, use it many times‘ approach
The end result should be a star schema capable of delivering measurements based on simple SELECT, JOIN, WHERE and GROUP BY statements
Trang 105 The Dimensional Model Function
One concludes that the dimensional form is a performance architecture intended to improve report query performance However so far, a full understanding of why dimensional models perform so well and what limits them has yet to be exposed
The star schema design is created to measure business It is created with a business function orientation, as opposed to the subject area orientation of the ER model
The form is one of centralization of a series of measures (facts) surrounded by attributes gives business context to those measurements
While some consumers may refer to the content as subjects, the real orientation is focused on business reporting and analysis It may be Sales Analysis or Risk Analysis, but these are
organized to support specific business functions and not provide general data as a subject Instead of presenting data as it exists in an ER model, or in the source, data is organized to make decisions
Some of Webster‘s definitions of the word ―Information‖ are:
2 ―INTELLIGENCE, NEWS‖
3 ―FACTS, DATA‖
Architects do not design dimensional models that deliver measurements (facts) randomly as data The purpose is to deliver organized information to the business clients that supports the client‘s business decision making function
To be ―information,‖ measures have to be organized and presented with functional context; without that, it is simply data Providing data is what an ER model does It delivers it without bias It‘s up the consumer to discern how to make it provide information In a dimensional model, much of that work of organizing data as information is performed in advance of the report execution
Therefore, a primary function for which the dimensional form is employed is that of a
performance architecture built upon the direct structuring of information for specific business function
It is important to make this distinction because there are other means of implementing
performance architectures for delivering information that do not rely on data denormalizations
in a database
And, this is not to say that dimensional model content is the final state of the information
organization In systems that employ the dimensional form, it represents the foundational state
of information that is further organized into reporting to deliver KPIs, comparisons, trends,
graphics and other business oriented presentations of information
Trang 116 The Limits of Single Form Design
All that has been examined to this point represents the foundation for the remaining
examination
Architects realized that there are limits to form An automobile maker creates a variety of forms for different functional needs Each of those forms has recognizable limits A Freightliner semi-truck with a raised roof sleeper, Hendrickson AIRTEK axels, and front suspensions is
designed for long distance freight hauling in comfort, but it is not functional for the morning commute One might drive it downtown, but the fuel consumption empties the wallet and guarantied, it won‘t fit in the parking garage
Clearly design form has limits The architect‘s role is to understand those design form limits and produce system designs using integrated design forms to fulfill functional requirements
And by form, not only model forms are available for examination, but also a wide variety of technology based design forms as well
6.1 Function Limiting Characteristics the Dimensional Form
The dimensional model is a powerful performance architecture form for the delivery of
information to businesses when properly applied Like the ER form, the dimensional form has limitations in its recognized function
6.1.1 The Dimensional Form Does Not Extend Well
Ability to extend is a relative evaluation comparing one form to another The evaluation is really about how much disruption to process, existing data and retesting is involved in existing implementations
Purveyors of the dimensional model sometimes state that extending the dimensional form is as easy as adding new attributes to dimensions, or new dimensions and dimensional keys to an existing fact table from a specific point in time forward, and backfilling attributes and foreign keys with the standard defaults for NULL or Not Applicable definition
The reality of dimensional model extension is rather different
1 Changes in Processing
Even when this approach can be taken, the addition of new content means there is a change
in existing processing Aside from additional sourcing, the processing typically involves
integration with content sourced from multiple entity sources If the target is an existing fact