Figure 13-5: Splitting a data warehouse table into separate physical partitions.Fact fact_id location_id FK time_id FK show_id FK musician_id FK band_id FK advertisement_id FK discograph
Trang 1Hash Keys and ISAM Keys There are other, less commonly used indexes, such as hash keys and Indexed Sequential Access Method (ISAM) keys Both are somewhat out of date in the larger-scale relational database engines; however, Microsoft Access does make use of a mixture of ISAM/BTree indexing techniques, in its JETdatabase Both ISAM and hash indexes are not good for heavily changing data because their structures will over-flow with newly introduced records Similar to bitmap indexes, hash and ISAM keys must be rebuilt regularly to maintain their advantage in processing speed advantage Frequent rebuilds minimize on performance killing overflow
Clusters, Index Organized Tables, and Clustered Indexes Clusters are used to contain fields from tables, usually a join, where the cluster contains a physical copy
of a small portion of the fields in a table — perhaps the most commonly accessed fields Essentially, clus-ters have been somewhat superseded by materialized views A clustered index (index organized table,
or IOT) is a more complex type of a cluster where all the fields in a single table are reconstructed, not in
a usual heap structure, but in the form of a BTree index In other words, for an IOT, the leaf blocks in the diagram shown in Figure 13-3 would contain not only the indexed field value, but also all the rest of the fields in the table (not just the primary key values)
Understanding Auto Counters
Sequences are commonly used to create internally generated (transparent) counters for surrogate pri-mary keys Auto counters are called sequences in some database engines This command would create a sequence object:
CREATE SEQUENCE BAND_ID_SEQUENCE START=1 INCREMENT=1 MAX=INFINITY;
Then you could use the previous sequence to generate primary keys for the BANDtable (see Figure 13-1),
as in the following INSERTcommand, creating a new band called “The Big Noisy Rocking Band.” INSERT INTO BAND (BAND_ID, GENRE_ID, BAND, FOUNDING_DATE)
VALUES ( BAND_ID_SEQUENCE.NEXT, (SELECT GENRE_ID FROM GENRE WHERE GENRE=”Rock”),
“The Big Noisy Rocking Band”, 25-JUN-2005 );
Understanding Partitioning and Parallel Processing
Partitioning is just that — it partitions It separates tables into separate physical partitions The idea is that processing can be executed against individual partitions and even in parallel against multiple parti-tions at the same time Imagine a table with 1 million records Reading those 1 million records can take
an inordinately horrible amount of time; however, dividing that 1 million record table into 100 separate physical partitions can allow queries to read much fewer records This, of course, assumes that records are read within the structure of partition separation As in previous sections of this chapter, the easiest way to explain partitioning, what it is, and how it works, is to just demonstrate it The diagram in Fig-ure 13-5 shows the splitting of a data warehouse fact table in separate partitions
Trang 2Figure 13-5: Splitting a data warehouse table into separate physical partitions.
Fact fact_id location_id (FK) time_id (FK) show_id (FK) musician_id (FK) band_id (FK) advertisement_id (FK) discography_id (FK) merchandise_id (FK) genre_id (FK) instrument_id (FK) cd_sale_amount merchandise_sale_amount advertising_cost_amount show_ticket_sales_amount
Genre
genre_id parent_id genre
Band
band_id band founding_date
Advertisement
advertisement_id date
text
Discography
discography_id cd_name release_date price
Show_Venue
show_id venue address_line_1 address_line_2 town zip postal_code country show_date show_time
Merchandise
merchandise_id type price
Instrument
instrument_id
section_id
instrument
Musician
musician_id
musician
phone
Facts
Fact
fact_id
show_id (FK)
musician_id (FK)
band_id (FK)
advertisement_id (FK)
discography_id (FK)
merchandise_id (FK)
genre_id (FK)
instrument_id (FK)
cd_sale_amount
merchandise_sale_amount
advertising_cost_amount
show_ticket_sales_amount
show_id (FK) musician_id (FK) band_id (FK) advertisement_id (FK) discography_id (FK) merchandise_id (FK) genre_id (FK) instrument_id (FK) cd_sale_amount merchandise_sale_amount advertising_cost_amount show_ticket_sales_amount
Partition 2
Fact fact_id show_id (FK) musician_id (FK) band_id (FK) advertisement_id (FK) discography_id (FK) merchandise_id (FK) genre_id (FK) instrument_id (FK) cd_sale_amount merchandise_sale_amount advertising_cost_amount show_ticket_sales_amount Partition 3
Fact fact_id show_id (FK) musician_id (FK) band_id (FK) advertisement_id (FK) discography_id (FK) merchandise_id (FK) genre_id (FK) instrument_id (FK) cd_sale_amount merchandise_sale_amount advertising_cost_amount show_ticket_sales_amount Partition 4
Fact fact_id show_id (FK) musician_id (FK) band_id (FK) advertisement_id (FK) discography_id (FK) merchandise_id (FK) genre_id (FK) instrument_id (FK) cd_sale_amount merchandise_sale_amount advertising_cost_amount show_ticket_sales_amount Partition 5
394
Chapter 13
Trang 3In some database engines, you can even split materialized views into partitions, in the same way as tables can be partitioned The fact table shown in Figure 13-5 is (as fact tables should be) all referencing surrogate primary keys, as foreign keys to dimensions It is easier to explain some of the basics of parti-tioning using the materialized view created earlier in this chapter The reason is because the materialized view contains the descriptive dimensions, as well as the surrogate key integer values In other words, even though not technically correct, it is easier to demonstrate partitioning on dimensional descriptions, such as a region of the world (North America, South America, and so on), as opposed to partitioning based on an inscrutable LOCATION_IDforeign key value This is the materialized view created earlier: CREATE MATERIALIZED VIEW MV_MUSIC
ENABLE REFRESH ENABLE QUERY REWRITE SELECT F.*, I.*, MU.*, F.*, B.*, A.*, D.*, SV.*, ME.*, T.*, L.*
FROM FACT A JOIN INSTRUMENT I ON (I.INSTRUMENT_ID = A.INSTRUMENT_ID) JOIN MUSICIAN MU ON (MU.MUSICIAN_ID = F.MUSICIAN_ID)
JOIN GENRE G ON (G.GENRE_ID = F.GENRE_ID) JOIN BAND B ON (B.BAND_ID = F.BAND_ID) JOIN ADVERTISEMENT A ON (A.ADVERTISEMENT_ID = F.ADVERTISEMENT_ID) JOIN DISCOGRAPHY D ON (D.DISCOGRAPHY_ID = F.DISCOGRAPHY_ID) JOIN SHOW_VENUE SV ON (SV.SHOW_ID = F.SHOW_ID)
JOIN MERCHANDISE ON (M.MERCHANDISE_ID = F.MERCHANDISE_ID) JOIN TIME ON (T.TIME_ID = F.TIME_ID)
JOIN LCOATION ON (L.LOCATION_ID = F.LOCATION_ID);
Now, partition the materialized view based on regions of the world — this one is called a list partition:
CREATE TABLE PART_MV_REGIONAL PARTITION BY LIST (REGION) (
PARTITION PART_AMERICAS VALUES (“North America”,”South America”), PARTITION PART_ASIA VALUES (“Middle East”,”Far East”,”Near East”), PARTITION PART_EUROPE VALUES (“Europe”,”Russian Federation”), PARTITION PART_OTHER VALUES (DEFAULT)
) AS SELECT * FROM MV_MUSIC;
The DEFAULToption implies all regions not in the ones listed so far.
Another type of partition is a range partition where each separate partition is limited by a range
of values, for each partition This partition uses the release date of CDs stored in the field called DISCOGRAPHY.RELEASE_DATE:
CREATE TABLE PART_CD_RELEASE PARTITION BY RANGE (RELEASE_DATE) (
PARTITION PART_2002 VALUES LESS THAN (1-JAN-2003), PARTITION PART_2003 VALUES LESS THAN (1-JAN-2004), PARTITION PART_2004 VALUES LESS THAN (1-JAN-2005), PARTITION PART_2005 VALUES LESS THAN (MAXIMUM), ) AS SELECT * FROM MV_MUSIC;
The MAXIMUMoption implies all dates into the future, from January 1, 2005, and beyond the year 2005.
You can also create indexes on partitions Those indexes can be created as locally identifiable to each par-tition, or globally to all partitions created for a table, or materialized view That is partitioning There are other more complex methods of partitioning, but these other methods are too detailed for this book
Trang 4That’s all you need to know about advanced database structures Take a quick peek at the physical side
of things in the guise of hardware resources
Understanding Hardware Resources
This section briefly examines some facts about hardware, including some specialized database server architectural structures, such as RAID arrays and Grid computing
How Much Hardware Can You Afford?
Windows computers are cheap, but they have a habit of breaking UNIX boxes (computers are often called “boxes”) are expensive and have excellent reliability I have heard of cases of UNIX servers run-ning for years, with no problems whatsoever Typically, a computer system is likely to remain stable as long as it is not tampered with The simple fact is that Windows boxes are much more easily tampered with than UNIX boxes, so perhaps Windows machines have an undeserved poor reputation, as far as reliability is concerned
How Much Memory Do You Need?
OLTP databases are memory- and processor-intensive Data warehouse databases are I/O-intensive, and other than heavy processing power, couldn’t care less how much RAM is allocated The heavy type of memory usage for a relational database usually has a lot to do with concurrency and managing the load
of large number of users, accessing your database all at the same time That’s all about concurrency and much more applicable to OLTP databases, rather than data warehouse databases For an OLTP database, quite often the more RAM you have, the better Note, however, that sizing up buffer cache values to the maximum amount of RAM available is pointless, even for an OLTP database The more RAM allocated for use by a database, the more complex those buffers become for a database to manage
In short, data warehouses do not need a lot of memory to temporarily store the most heavily used tables
in the database into RAM There is no point, as data warehouses tend to read lots of data from lots of tables, occasionally RAM is not as important in a data warehouse as it is in an OLTP database
Now, briefly examine some specialized aspects of hardware usage, more from an architectural perspective
Understanding Specialized Hardware
Architectures
This section examines the following:
❑ RAID arrays
❑ Standby databases
❑ Replication
❑ Grids and computer clustering
396
Chapter 13
Trang 5RAID Arrays
The acronym RAID stands for Redundant Array of Inexpensive Disks That means a bunch of small, cheap disks Some RAID array hardware setups are cheap Some are astronomically expensive You get what you pay for, and you can purchase what suits your requirements RAID arrays can give huge per-formance benefits for both OLTP and data warehouse databases
Some of the beneficial factors of using RAID arrays are recoverability (mirroring), fast random access (striping and multiple disks with multiple bus connections — higher throughput capacity), and parallel I/O activity where more than one disk can be accessed at the same time (concurrently) There are numer-ous types of RAID array architectures, with the following being the most common:
❑ RAID 0 — RAID 0 is striping Striping splits files into pieces, spreading them over multiple
disks RAID 0 gives fast random read and write access, and is thus appropriate for OLTP data-bases Rapid recoverability and redundancy is not catered for RAID 0 is a little risky because of lack of recoverability Data warehouses that need to be highly contiguous (data on disk is all in one place) are not catered for by random access; however, RAID 0 can sometimes be appropriate for data warehouses, where large I/O executions utilize parallel processing, accessing many disks simultaneously
❑ RAID 1 — RAID 1 is mirroring Mirroring makes multiple copies of files, duplicating database
changes at the I/O level on disk Mirroring allows for excellent recoverability capabilities RAID
1 can sometimes cause I/O bottleneck problems because of all the constant I/O activity associ-ated with mirroring, especially with respect to frequently written tables — creating mirrored hot
blocks A hot block is a block in a file that is accessed more heavily than the hardware can cope
with Everything is trying to read and write that hot block at the same time RAID 1 can pro-vide recoverability for OLTP databases, but can hurt performance RAID 1 is best used in data warehouses where mirroring allows parallel read execution, of more than one mirror, at the same time
❑ RAID 0+1 — RAID 0+1 combines the best of both worlds from RAID 0 and RAID 1 — using both
striping and mirroring Both OLTP and data warehouse I/O performance will be slowed some-what, but RAID 0+1 can provide good all-around recoverability and performance, perhaps offering the best of both worlds, for both OLTP and data warehouse databases
❑ RAID 5 — RAID 5 is essentially a minimized form of mirroring, duplicating only parity and not
the real data RAID 5 is effective with expensive RAID architectures, containing large chunks of purpose-built, RAID-array contained, onboard buffering RAM memory
Those are some of the more commonly implemented RAID array architectures It is not necessary for you to understand the details but more important that you know this stuff actually exists
Standby Databases
A standby database is a failover database A standby database has minimal activity, usually only adding
new records, changing existing records, and deleting existing records Some database engines do allow for more sophisticated standby database architectures, but once again, the intention in this chapter is to inform you of the existence of standby databases
Trang 6Figure 13-6 shows a picture of how standby databases work A primary database in Silicon Valley (San Jose) is used to service applications, catering to all changes to a database In Figure 13-6, two standby databases are used, one in New York and one in Orlando The simplest form of change tracking is used
to transfer changes from primary to standby databases The simplest form of transfer is log entries Most larger database engines have log files, containing a complete history of all transactions
Figure 13-6: Standby database architecture allows for instant switchover (failover) recoverability
Log files allow for recoverability of a database Log files store all changes to a database If you had to
recover a database from backup files that are a week old, the database could be recovered by applying all changes stored in log files (for the last week) The result of one week-old cold backups, plus log entries for the last week, would be an up-to-date database.
The most important use of standby database architecture is for that of failover In other words, if the pri-mary database fails (such as when someone pulls the plug, or San Jose is struck by a monstrous earth-quake), the standby database automatically takes over In the case of Figure 13-6, if the big one struck near San Jose, the standby database in New York or Orlando would automatically failover, assuming all responsibilities, and become the new primary database What is implied by failover is that a standby database takes over the responsibilities of servicing applications, immediately — perhaps even within
a few seconds The purest form of standby database architecture is as a more or less instant response backup, generally intended to maintain full service to end-users
Some relational database engines allow standby databases to be utilized in addition to that of being
just a failover option Standby databases can sometimes be used as read-only, slightly behind, reporting
databases Some database engines even allow standby databases to be changeable, as long as structure and content from the primary database is not disturbed In other words, a standby database could con-tain extra and additional tables and data, on top of what is being sent from the primary database
New York
Slave Database
Standby Database
Log E ntr y T
ra nsfe r
Log Entry Transfe r
Primary
San Jose
398
Chapter 13
Trang 7Typically, this scenario is used for more sophisticated reporting techniques, and possibly standby databases can even be utilized as a basis for a data warehouse database
Replication
Database replication is a method used to duplicate (replicate) data from a primary or master database, out
to a number of other copies of the master database As you can see in Figure 13-7, the master database replicates (duplicate) changes made on the master, out to two slave databases in New York and Orlando This is similar in nature to standby database architecture, except that replication is much more powerful, and, unfortunately, more complicated to manage than standby database architecture Typically, replica-tion is used to distribute data across a wide area network (WAN) for a large organizareplica-tion
Figure 13-7: Replication is often used for distributing large quantities of data
Tables and data can’t be altered at slave databases — only by changes passed from the master database
In the case of Figure 13-8, a master-to-master, rather than master-to-slave, configuration is adopted.
A master-to-slave relationship implies that changes can only be passed in one direction, obviously from the master to the slave database; therefore, database changes are distributed from master to slave data-bases Of course, being replication, slave databases might need to have changes made to them However, changes made at slave databases can’t be replicated back to the master database
Figure 13-8 shows just the opposite, where all relationships between all replicated (distributed databases) are master-to-master A master-to-master replication environment implies that changes made to any database are distributed to all other databases in the replicated environment across the WAN Master-to-master replication is much more complicated than Master-to-master-to-slave replication
New York
Slave Database
Slave Database
Maste r-to-Sla ve
Master-to-Sla ve
Master
San Jose
Trang 8Figure 13-8: Replication can be both master-to-slave and master-to-master.
Grids and Computer Clustering
Computer grids are clusters of cheap computers, perhaps distributed on a global basis, connected using even something as loosely connected as the Internet The Search for Extra Terrestrial Intelligence (SETI) program, where processing is distributed to people’s personal home computers (processing when a screensaver is on the screen), is a perfect example of grid computing Where RAID arrays cluster inex-pensive disks, grids can be made of clusters of relatively inexinex-pensive computers Each computer acts as
a portion of the processing and storage power of a large, grid-connected computer, appearing to end users as a single computational processing unit
Clustering is a term used to describe a similar architecture to that of computer grids, but the computers
are generally very expensive, and located within a single data center, for a single organization The dif-ference between grid computing and clustered computing is purely one of scale — one being massive and the other localized
Common to both grids and clusters is that computing resources (CPU and storage) are shared transpar-ently In other words, a developer writing programs to access a database does not even need to know that the computer for which code is being written is in reality a group of computers, built as either a grid
Replication is all about distribution of data to multiple sites, typically across a
WAN Standby is intentionally created as failover; however, in some database
engines, standby database technology is now so sophisticated, that it is very close
in capability to that of even master-to-master replicated databases.
New York
Slave Database
Slave Database
Mast
er -to-Mas
ter
Master-to-Maste r
Master
San Jose
Master- to-Master
400
Chapter 13
Trang 9or a cluster Grid Internet-connected computers could be as much as five years old, which is geriatric for
a computer — especially a personal computer They might have all been purchased in a yard sale If there
are enough senior computers, and they are connected properly, the grid itself could contain enormous
computing power
Clustered architectures are used by companies to enhance the power of their databases Grids, on the other hand, are often used to help processing for extremely large and complex problems that perhaps even a super computer might take too long to solve
Summar y
In this chapter, you learned about:
❑ Views and how to create them
❑ Sensible and completely inappropriate uses of views
❑ Materialized views and how to create them
❑ Nested materialized views and QUERY REWRITE
❑ Different types of indexes (including BTree indexes, bitmap indexes, and clustering)
❑ Auto counters and sequences
❑ Partitioning and parallel processing
❑ Creating list and range partitions
❑ Partitioning materialized views
❑ Hardware factors (including memory usage as applied to OLTP or data warehouse databases)
❑ RAID arrays for mirroring (recoverability) and striping (performance)
❑ Standby databases for recoverability and failover
❑ Replication of databases to cater to distribution of data
❑ Grid computing and clustering to harness as much computing power as possible This chapter has moved somewhat beyond the realm of database modeling, examining specialized database objects, some brief facts about hardware resources, and finally some specialized database architectures