Beginning Database Design- P11 docx

The Relational Database Model and Data Warehouses The traditional OLTP transactional type of relational database model does not cater for data warehouse requirements.. Data warehouses pe

Trang 1

The demands of the modern global economy and the Internet dictate that end user operational applications are required to be active 24/7, 365 days a year There is no window for any type of batch activity because when people are asleep in Europe, others are awake down under in Australia The global economy requires instant and acceptable servicing of the needs of a global user population

In reality, the most significant difference between OLTP databases and data warehouses extends all the way down to the hardware layer OLTP databases need highly efficient sharing of critical resources such as onboard memory (RAM), and have very small I/O requirements Data warehouses are completely opposite Data warehouses can consume large portions of RAM by transferring between disk and memory, in detriment to an OLTP database running on the same machine Where OLTP databases need resource sharing, data warehouses need to hog those resources for extended periods of time So, a data warehouse hogs machine resources An OLTP database attempts to share those same resources It is likely to have unacceptable response times because of a lack of basic I/O resources for both database types The result, therefore, is a requirement for a complete separation between operational (OLTP) and decision-support (data warehouse) activity This is why data warehouses exist!

The Relational Database Model and Data Warehouses

The traditional OLTP (transactional) type of relational database model does not cater for data warehouse requirements The relational database model is too granular “Granular” implies too many little pieces Processing through all those little-bitty pieces is too time consuming for large transactions, joining all those pieces together Similar to the object database model, the relational database model removes duplication and creates granularity This type of database model is efficient for front-end application performance, involving small amounts of data that are accessed frequently and concurrently by many users at once This is what an OLTP database does

Data warehouses, on the other hand, need throughput of huge amounts of data by relatively very few users Data warehouses process large quantities of data at once, mainly for reporting and analytical processing Also, data warehouses are regularly updated, but usually in large batch operations OLTP databases need lightning-quick response to many individual users Data warehouses perform enormous amounts of I/O activity over copious quantities of data; therefore, the needs of OLTP and data warehouse databases are completely contrary to each other, down to the lowest layer of hardware resource usage Hardware resource usage is the most critical consideration Software rests quite squarely on the shoulders

of your hardware Proper use of memory (RAM), disk storage, and CPU time to manage everything is the critical layer for all activity OLTP and data warehouse database differences extend all the way down to this most critical of layers OLTP databases require intensely sharable hardware structures (commonly

known as concurrency), needing highly efficient use of memory and processor time allocations Data

warehouses need huge amounts of disk space, processing power as well, but all dedicated to long-running

programs (commonly known as batch operations or throughput).

A data warehouse database simply cannot cope using a standard OLTP database relational database model Something else is needed for a data warehouse

Trang 2

Surrogate Keys in a Data Warehouse

Surrogate keys, as you already know, are replacement key values A surrogate key makes database access more efficient — usually In data warehouse databases, surrogate keys are possibly more important in terms of gluing together different data, even from different databases, perhaps even different database engines Sometimes different databases could be keyed on different values, or even contain different key values, which in the non-computerized world are actually identical

For example, a customer in a department of a company could be uniquely identified by the customer’s name In a second department, within the same company, the same customer could be identified by the name of a contact or even perhaps the phone number of that customer A third department could identify the same customer by a fixed-length character coding system All three definitions identify exactly the same customer If this single company is to have meaningful data across all departments, it must identify the three separate formats, all representing the same customer as being the same customer in the data warehouse A surrogate key is the perfect solution, using the same surrogate key value for each repetition

of the same customer, across all departments Surrogate key use is prominent in data warehouse database modeling

Referential Integrity in a Data Warehouse

Data warehouse data modeling is essentially a form of relational database modeling, albeit a simplistic form Referential integrity still applies to data warehouse databases; however, even though referential integrity applies, it is not essential to create primary keys, foreign keys, and their inter-table referential links (referential integrity) It is important to understand that a data warehouse database generally has two distinct activities The first activity is updating with large numbers of records added at once, some-times also with large numbers of records changed It is always best to only add or remove data in a data warehouse Changing existing data warehouse table records can be extremely inefficient simply because

of the sheer size of data warehouses

Referential integrity is best implemented and enforced when updating tables The second activity of a data warehouse is the reading of data When data is read, referential integrity does not need to be verified because no changes are occurring to records in tables On the contrary, because referential integrity implies creation of primary and foreign keys, and because the best database model designs make profligate use of primary and foreign key fields in SQL code, leave referential integrity intact for a data warehouse

So, now we know the origin of data warehouses and why they were devised What is the data warehouse dimensional database model?

Trang 3

The Dimensional Database Model

A standard, normalized, relational database model is completely inappropriate to the requirements of a data warehouse Even a denormalized relational database model doesn’t make the cut An entirely

dif-ferent modeling technique, called a dimensional database model, is needed for data warehouses A dimen-sional model contains what are called facts and dimensions A fact table contains historical transactions,

such as all invoices issued to all customers for the last five years That could be a lot of records

Dimensions describe facts.

The easiest way to describe the dimensional model is to demonstrate by example Figure 7-1 shows a relational table structure for both static book data and dynamic (transactional) book data The grayed out tables in Figure 7-1 are static data tables and others are tables containing data, which is in a constant state

of change Static tables are the equivalent of dimensions, describing facts (equivalent to transactions) So,

in Figure 7-1, the dimensions are grayed out and the facts are not

Figure 7-1: The OLTP relational database model for books

Customer customer_id

Shipper shipper_id shipper address phone email

customer address phone email credit_card_type credit_card#

credit_card_expiry

Sale sale_id ISBN (FK) shipper_id (FK) customer_id (FK) sale_price sale_date

Edition ISBN publisher_id (FK) publication_id (FK) print_date pages list_price format

Publisher publisher_id name

Publication publication_id subject_id (FK) author_id (FK) title

Author author_id name

Review review_id publication_id (FK) review_date text Subject

subject_id parent_id name

CoAuthor coauthor_id (FK) publication_id (FK)

Rank ISBN (FK) rank ingram_units

Trang 4

What Is a Star Schema?

The most effective approach for a data warehouse database model (using dimensions and facts) is called

a star schema Figure 7-2 shows a simple star schema for the REVIEWfact table shown in Figure 7-1

Figure 7-2: The REVIEWtable fact-dimensional structure

A more simplistic equivalent diagram to that of Figure 7-2 is shown by the star schema structure in Figure 7-3

Review review_id customer_id (FK) publication_id (FK) author_id (FK) publisher_id (FK) review_date text

Customer customer_id customer address phone email credit_card_type credit_card#

credit_card_expiry

Author author_id author

Publisher publisher_id publisher

Publication publication_id title

Trang 5

Figure 7-3: The REVIEWfact-dimensional structure is a star schema.

A star schema contains a single fact table plus a number of small dimensional tables If there is more than one fact table, effectively there is more than one star schema Fact tables contain transactional records, which over a period of time can come to contain very large numbers of records Dimension tables on the other hand remain relatively constant in record numbers The objective is to enhance SQL query join performance, where joins are executed between a single fact table and multiple dimensions, all on a single hierarchical level So, a star schema is a single, very large, very changeable, fact table, connected directly to a single layer of multiple, static-sized dimensional tables

Publisher

Customer

One-To-Many Relationship

Review

Trang 6

What Is a Snowflake Schema?

A snowflake schema is shown in Figure 7-4 A snowflake schema is a normalized star schema, such that

dimension entities are normalized (dimensions are separated into multiple tables) Normalized dimensions have all duplication removed from each dimension, such that the result is a single fact table, connected directly to some of the dimensions Not all of the dimensions are directly connected to the fact table In Figure 7-4, the dimensions are grayed out in two shades of gray The lighter shade of gray represents dimensions connected directly to the fact table (BOOK, AUTHOR, SUBJECT, SHIPPER, and CUSTOMER) The darker-shaded gray dimensional tables, are normalized subset dimensional tables, not connected to the fact table directly (PUBLISHER, PUBLICATION, and CATEGORY)

Figure 7-4: The SALEtable fact-dimensional structure

Customer customer_id

customer address phone email credit_card_type credit_card#

credit_card_expiry

Book ISBN publication_id (FK) publisher_id (FK) edition#

print_date pages list_price format rank ingram_units

Sale sale_id ISBN (FK) author_id (FK) shipper_id (FK) customer_id (FK) subject_id (FK) sale_price sale_date

Publisher

publisher_id publisher

Publication publication_id title

Subject subject_id category_id (FK) subject Category

category_id category

Trang 7

A more simplistic equivalent diagram to that of Figure 7-4 is shown by the snowflake schema in Figure 7-5.

Figure 7-5: The SALEfact-dimensional structure is a snowflake schema

The problem with snowflake schemas isn’t too many tables but too many layers Data warehouse fact tables can become incredibly large, even to millions, billions, even trillions of records The critical factor

in creating star and snowflake schemas, instead of using standard “nth” Normal Form layers, is decreasing the number of tables in SQL query joins The more tables in a join, the more complex a query, the slower

it will execute When fact tables contain enormous record counts, reports can take hours and days, not minutes Adding just one more table to a fact-dimensional query join at that level of database size could make the query run for weeks That’s no good!

Book

Publication

Publisher

Customer

Shipper Author

Subject

Sale

Category

Trang 8

The solution is an obvious one Convert (denormalize) a normalized snowflake schema into a star schema,

as shown in Figure 7-6 In Figure 7-6 the PUBLISHERand PUBLICATIONtables have been denormalized into the BOOKtable, plus the CATEGORYtable has been denormalized into the SUBJECTtable

Figure 7-6: A denormalized SALEtable fact-dimensional structure

A more simplistic equivalent diagram to that of Figure 7-6 is shown by the star schema in Figure 7-7

Sale sale_id ISBN (FK) author_id (FK) shipper_id (FK) customer_id (FK) subject_id (FK) sale_price sale_date

Customer customer_id customer address phone email credit_card_type credit_card#

credit_card_expiry

Book ISBN publisher title edition#

print_date pages list_price format rank ingram_units

Subject subject_id category subject

Trang 9

Figure 7-7: The SALEfact-dimensional structure denormalized into a star schema.

What does all this prove? Not much, you might say On the contrary, two things are achieved by using fact-dimensional structures and star schemas:

❑ Figure 7-1 shows a highly normalized table structure, useful for high-concurrency, precision record-searching databases (an OLTP database) Replacing this structure with a fact-dimensional structure (as shown in Figure 7-2, Figure 7-4, and Figure 7-6) reduces the number of tables As you already know, reducing the number tables is critical to SQL query performance Data warehouses consist of large quantities of data, batch updates, and incredibly complex queries The fewer tables, the better It just makes things so much easier with fewer tables, especially because there is so much data The following code is a SQL join query for the snowflake schema, joining all nine tables in the snowflake schema shown in Figure 7-5

SELECT * FROM SALE SAL JOIN AUTHOR AUT JOIN CUSTOMER CUS

JOIN SHIPPER SHP JOIN SUBJECT SUB JOIN CATEGORY CAT JOIN BOOK BOO JOIN PUBLISHER PBS JOIN PUBLICATION PBL WHERE

GROUP BY

ORDER BY ;

❑ If the SALEfact table has 1 million records, and all dimensions contain 10 records each, a Cartesian product would return 106multiplied by 109records That makes for 1015records That

is a lot of records for any CPU to process

Book

Customer

Shipper Author

Subject

Sale

Trang 10

A Cartesian product is a worse-case scenario.

❑ Now look at the next query

SELECT * FROM SALE SAL JOIN AUTHOR AUT

JOIN CUSTOMER CUS JOIN SHIPPER SHP JOIN SUBJECT SUB JOIN BOOK BOO WHERE

GROUP BY

ORDER BY ;

❑ Using the star schema from Figure 7-7, assuming the same number of records, a join occurs between one fact table and six dimensional tables That is a Cartesian product of 106multiple

by 106, resulting in 1012records returned The difference between 1012and 1015is three deci-mals Three decimals is not just three zeroes and thus 1,000 records The difference is actually 1,000,000,000,000,000 – 1,000,000,000,000 = 999,000,000,000,000 That is effectively just a little less than 1015 The difference between six dimensions and nine dimensions is more or less infinite, from the perspective of counting all those zeros Fewer dimensions make for faster queries That’s why it is so essential to denormalize snowflake schemas into star schemas

❑ Take another quick glance at the snowflake schema in Figure 7-4 and Figure 7-5 Then examine the equivalent denormalized star schema in Figure 7-6 and Figure 7-7 Now put yourself into the shoes of a hustled, harried and very busy executive — trying to get a quick report Think as

an end-user, one only interested in results Which diagram is easier to decipher as to content and meaning? The diagram in Figure 7-7 is more complex than the diagram in Figure 7-5? After all, being an end-user, you are probably not too interested in understanding the complexities of how to build SQL join queries You have bigger fish to fry The point is this: The less complex the table structure, the easier it will be to use This is because a star schema is more representa-tive of the real world than a snowflake schema Look at it this way A snowflake schema is more deeply normalized than a star schema, and, therefore, by definition more mathematical Something more mathematical is generally of more use to a mathematician than it is to an exec-utive manager The execexec-utive is trying to get a quick overall impression of whether his company will sell more cans of lima beans, or more cans of string beans, over the course of the next ten years If you are a computer programmer, you will quite probably not agree with this analogy That tells us the very basics of data warehouse database modeling How can a data warehouse database model be constructed?

How to Build a Data Warehouse

Database Model

Now you know how to build star schemas for data warehouse database models As you can see, a star schema is quite different from a standard relational database model (Figure 7-1) The next step is to examine the process, or the steps, by which a data warehouse database model can be built

Định dạng
Số trang	20
Dung lượng	641,12 KB