722 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL Isolation actually becomes more complicated in practice, because one transaction may or may not actually see the data inserted, updat
Trang 1722 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL
Isolation actually becomes more complicated in practice, because one transaction may or may not actually see the data inserted, updated, or deleted by another transaction This will be dealt with in detail in the section on isolation levels
32.2.4 Durability
The database is stored on durable media, so that if the database program
is destroyed, the database itself persists Furthermore, the database can
be restored to a consistent state when the database system is restored Log files and backup procedures figure into this property, as well as disk writes done during processing
This is all well and good if you have just one user accessing the database at a time But one of the reasons you have a database system is that you also have multiple users who want to access it at the same time
in their own sessions This leads us to concurrency control
32.3 Concurrency Control
Concurrency control is the part of transaction handling that deals with the way multiple users access the shared database without running into each other—like a traffic light system One way to avoid any problems is
to allow only one user in the database at a time The only problem with that solution is that the other users are going to get lousy response time Can you seriously imagine doing that with a bank teller machine system
or an airline reservation system, where tens of thousands of users are waiting to get into the system at the same time?
32.3.1 The Five Phenomena
If all you do is execute queries against the database, then the ACID properties hold The trouble occurs when two or more transactions want
to change the database at the same time In the SQL model, there are five ways that one transaction can affect another:
transaction, T2, then further modifies that data item before T1 performs a COMMIT or ROLLBACK If T1 or T2 then performs a
reason dirty writes are bad is that they can violate database consistency Assume there is a constraint between x and y (e.g., x =
y), and T1 and T2 each maintain the consistency of the constraint
Trang 232.3 Concurrency Control 723
if run alone However, the constraint can easily be violated if the two transactions write x and y in different orders, which can only happen if there are dirty writes
then reads that row before T1 performs a COMMIT WORK If T1 then performs a ROLLBACK WORK, T2 will have read a row that was never committed, and that may thus be considered to have never existed
T2 then modifies or deletes that row and performs a COMMIT WORK If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted
some <search condition> Transaction T2 then executes
statements that generate one or more rows that satisfy the
then repeats the initial read with the same <search
condition>, it obtains a different collection of rows )
transaction T1 reads a data item and then T2 updates the data item (possibly based on a previous read), then T1 (based on its earlier read value) updates the data item and COMMITs
These phenomena are not always bad things If the database is being used only for queries, without any changes being made during the workday, then none of these problems will occur The database system will run much faster if you do not have to try to protect yourself from them They are also acceptable when changes are being made under certain circumstances
Imagine that I have a table of all the cars in the world I want to execute a query to find the average age of drivers of red sport cars This query will take some time to run and during that time, cars will be crashed, bought and sold, new cars will be built, and so forth But I accept a situation with the five phenomena listed above, because the average age of the information will not change that much from the time I start the query to the time it finishes Changes after the second decimal place really don’t matter
Trang 3724 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL
However, you don’t want any of these phenomena to occur in a database where the husband makes a deposit to a joint account and his wife makes a withdrawal This leads us to the transaction isolation levels The original ANSI model included only P1, P2, and P3 The other definitions first appeared in Microsoft Research Technical Report: MSR-TR-95-51 “A Critique of ANSI SQL Isolation Levels” by Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil (1995)
32.3.2 The Isolation Levels
In standard SQL, the user gets to set the isolation level of the transactions in his session The isolation level avoids some of the phenomena we just talked about and gives other information to the database The syntax for the <set transaction statement> is as follows
SET TRANSACTION <transaction mode list>
<transaction mode> ::=
<isolation level>
| <transaction access mode>
| <diagnostics size>
<diagnostics size> ::= DIAGNOSTICS SIZE <number of conditions
<transaction access mode> ::= READ ONLY | READ WRITE
<isolation level> ::= ISOLATION LEVEL <level of isolation>
<level of isolation> ::=
READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SERIALIZABLE
The optional <diagnostics size> clause tells the database to set
up a list for error messages of a given size This is a Standard SQL feature, so you might not have it in your particular product The reason is that a single statement can have several errors in it, and the engine is supposed to find them all and report them in the diagnostics area via a
Trang 432.3 Concurrency Control 725
ONLY option means that this is a query and lets the SQL engine know
that it can relax a bit The READ WRITE option lets the SQL engine know
that rows might be changed, and that it has to watch out for the five
phenomena
The important clause, which is implemented in most current SQL
products, is the <isolation level> clause The isolation level of a
transaction defines the degree to which the operations of one transaction
are affected by concurrent transactions The isolation level of a
transaction is SERIALIZABLE by default, but the user can explicitly set
it in the <set transaction statement>
The isolation levels each guarantee that each transaction will be
executed completely or not at all, and that no updates will be lost When
the SQL engine detects the inability to guarantee the serializability of two
or more concurrent transactions or detects unrecoverable errors, it may
initiate a ROLLBACK WORK statement on its own
Let’s take a look at a table (Table 32.1) of the isolation levels and the
initial three phenomena (P1, P2, and P3) A “Yes” means that the
phenomena are possible under that isolation level:
Table 32.1 Isolation Levels and the Initial Three Phenomena
Isolation Levels and the Three Phenomena
Isolation Level P1 P2 P3
========================================
SERIALIZABLE No No No
REPEATABLE READ No No Yes
READ COMMITTED No Yes Yes
READ UNCOMMITTED Yes Yes Yes
same results that the concurrent transactions would have, if they had
been done in some serial order A serial execution is one in which each
transaction executes to completion before the next transaction begins
The users act as if they are standing in a line waiting to get complete
access to the database
same image of the database to the user during his session
session see rows that other transactions commit while this session is
running
Trang 5726 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL
session see rows that other transactions create without necessarily committing while this session is running
Regardless of the isolation level of the transaction, phenomena P1, P2, and P3 shall not occur during the implied reading of schema definitions performed on behalf of executing a statement, the checking
of integrity constraints, and the execution of referential actions associated with referential constraints We do not want the schema itself changing on users
32.3.3 CURSOR STABILITY Isolation Level
locking behavior for SQL cursors by adding a new read action for FETCH
from a cursor and requiring that a lock be held on the current item of the cursor The lock is held until the cursor moves or is closed, possibly by a commit Naturally, the fetching transaction can update the row, and in that case a write lock will be held on the row until the transaction
makes CURSOR STABILITY stronger than READ COMMITTED and weaker than REPEATABLE READ
prevent lost updates for rows read via a cursor READ COMMITTED, in some systems, is actually the stronger cursor stability The ANSI standard allows this
The SQL standards do not say how you are to achieve these results
However, there are two basic classes of concurrency control methods—
optimistic and pessimistic Within those two classes, each vendor will have his own implementation
32.4 Pessimistic Concurrency Control
Pessimistic concurrency control is based on the idea that transactions are expected to conflict with each other, so we need to design a system to avoid the problems before they start
All pessimistic concurrency control schemes use locks A lock is a flag placed in the database that gives exclusive access to a schema object to one user Imagine an airplane toilet door, with its “occupied” sign
The differences are the level of locking they use; setting those flags on and off costs time and resources If you lock the whole database, then you will have, in effect, a serial batch processing system, since only one transaction at a time is active In practice, you would do this only for
Trang 632.5 SNAPSHOT Isolation: Optimistic Concurrency 727
system maintenance work on the whole database If you lock at the table level, performance can suffer because users must wait for the most common tables to become available However, there are transactions that
do involve the whole table and this lock level will use only one flag
If you lock the table at the row level, then other users can get to the rest of the table and you will have the best possible shared access You will also have a huge number of flags to process, and performance will suffer This approach is generally not practical
Page locking is in between table and row locking This approach puts
a lock on subsets of rows within the table that include the desired values The name comes from the fact that this lock level is usually implemented with pages of physical disk storage Performance depends on the statistical distribution of data in physical storage, but it is generally a good compromise
32.5 SNAPSHOT Isolation: Optimistic Concurrency
Optimistic concurrency control is based on the idea that transactions are not very likely to conflict with each other, so we need to design a system
to handle the problems as exceptions after they actually occur
In Snapshot Isolation, each transaction reads data from a snapshot of the (committed) data as of the time the transaction started, called its Start_timestamp This time may be any time before the transaction’s first read A transaction running in Snapshot Isolation is never blocked attempting a read because it is working on its private copy of the data But this means that at any time, each data item might have multiple versions, created by active and committed transactions
When the transaction T1 is ready to commit, it gets a
commit_timestamp, which is later than any existing start_timestamp or commit_timestamp The transaction successfully COMMITs only if no other transaction T2 with a commit_timestamp in T1’s execution interval [start_timestamp, commit_timestamp] wrote data that T1 also wrote Otherwise, T1 will ROLLBACK This “first committer wins” strategy prevents lost updates (phenomenon P4) When T1 COMMITs, its changes become visible to all transactions whose start_timestamps are larger than T1’s commit-timestamp
Snapshot isolation is nonserializable because a transaction’s reads come at one instant and the writes at another We assume we have
several transactions working on the same data and a constraint that (x + y) should be positive Each transaction that writes a new value for x and y
is expected to maintain the constraint While T1 and T2 both act
Trang 7728 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL
properly in isolation, the constraint fails to hold when you put them together The possible problems are:
database constraint between two data items, x and y, in the
database Here are two anomalies arising from constraint violation:
second transaction T2 updates x and y to new values and COM-MITs Now, if T1 reads y, it may see an inconsistent state, and
therefore produce an inconsistent state as output
Skew where x = y More typically, a transaction reads two
differ-ent but related items (e.g., referdiffer-ential integrity)
with constraint C, and then a T2 reads x and y, writes x, and
and y, it might be violated As an example, consider a constraint at
a bank, where account balances are allowed to go negative as long
as the sum of commonly held balances remains nonnegative, with
an anomaly arising as in history H5
Clearly, neither A5A nor A5B could arise in histories where P2 is precluded, since both A5A and A5B have T2 write a data item that has been previously read by an uncommitted T1 Thus, phenomena A5A and A5B are only useful for distinguishing isolation levels below
The ANSI SQL definition of REPEATABLE READ, in its strictest interpretation, captures a degenerate form of row constraints, but misses the general concept To be specific, locking REPEATABLE READ on Table 2 provides protection from row constraint violations, but the ANSI SQL definition of Table 1, forbidding anomalies A1 and A2, does not Snapshot Isolation, however, is surprisingly strong, even stronger than READ COMMITTED
This approach predates databases by decades It was implemented manually in the central records department of companies when they started storing data on microfilm You do not get the actual microfilm; instead, they make a timestamped photocopy for you You take the copy
to your desk, mark it up, and return it to the central records department The Central Records clerk timestamps your updated document,
photographs it and adds it to the end of the roll of microfilm
Trang 832.6 Logical Concurrency Control 729
But what if user number two also went to the central records
department and got a timestamped photocopy of the same document? The Central Records clerk has to look at both timestamps and make a decision If the first user attempts to put his updates into the database while the second user is still working on his copy, then the clerk has to either hold the first copy or wait for the second copy to show up or to return it to the first user When both copies are in hand, the clerk stacks the copies on top of each other, holds them up to the light and looks to see if there are any conflicts If both updates can be made to the
database, he does so If there are conflicts, he must either have rules for resolving the problems or he has to reject both transactions This is a kind of row-level locking, done after the fact
32.6 Logical Concurrency Control
Logical concurrency control is based on the idea that the machine can analyze the predicates in the queue of waiting queries and processes on a purely logical level, and then determine which of the statements can be allowed to operate on the database at the same time
Clearly, all SELECT statements can operate at the same time, since they do not change the data After that, it is tricky to determine which statements conflict with the others For example, one pair of UPDATE
statements on two separate tables might be allowed only in a certain order because of PRIMARY KEY and FOREIGN KEY constraints Another of pair of UPDATE statements on the same tables might be disallowed because they modify the same rows and leave different final states in them
However, a third pair of UPDATE statements on the same tables might
be allowed because they modify different rows and have no conflicts with each other
There is also the problem of having statements waiting too long in the queue to be executed This is a version of livelock, which we discuss in the next section The usual solution is to assign a priority number to each waiting transaction and then decrement that priority number when they have been waiting for a certain length of time Eventually, every transaction will arrive at priority one and be able to go ahead of any other transaction
This approach also allows you to enter transactions at a higher priority than the transactions in the queue While it is possible to create a livelock this way, it is not a problem and it lets you bump less important jobs in favor of more important jobs, such as payroll checks
Trang 9730 CHAPTER 32: TRANSACTIONS AND CONCURRENCY CONTROL
32.7 Deadlock and Livelocks
It is possible for a user to fail to complete a transaction for reasons other than the hardware failing A deadlock is a situation where two or more users hold resources that the others need and neither party will surrender the objects to which they have locks To make this more concrete, imagine that both user A and user B need tables X and Y User
A gets a lock on table X, and User B gets a lock on table Y They both sit and wait for their missing resource to become available; it never
happens The common solution for a deadlock is for the DBA to kill one
or more of the sessions involved and rollback their work
In a livelock, a user is waiting for a resource, but never gets it because other users keep grabbing it before he gets a chance None of the other users hold onto the resource permanently, as in a deadlock, but as a group they never free it To make this more concrete, imagine user A needs all of table X But a hundred other users are always updating table X, so that user A cannot find a page without a lock on in
it the table He sits and waits for all the pages to become available; it never happens in time
The DBA can, again, kill one or more of the sessions involved and rollback their work In some systems, he can raise the priority of the livelocked session so that it can seize the resources as they become available
None of this is trivial, and each database system will have its own version of transaction processing and concurrency control This should not be of great concern to the applications programmer, but should be the responsibility of the DBA But it is nice to know what happens under the covers
Trang 10C H A P T E R
33
Optimizing SQL
THERE IS NO SET of rules for writing code that will take the best advantage
of every query optimizer on every SQL product The query optimizers depend on the underlying architecture and are simply too different for universal rules; however, we can make some general statements Just remember that you have to test code What would improve
performance in one SQL implementation might have no effect in another or make the performance worse
There are two kinds of optimizers: cost-based and based A rule-based optimizer (such as Oracle before version 7.0) looks at the syntax
of the query and plans how to execute the query without considering the size of the tables or the statistical distribution of the data It will parse a query and execute it in the order in which it was written, perhaps doing some reorganization of the query into an equivalent form using some syntax rules Basically, it is no optimizer at all
A cost-based optimizer looks at both the query and the statistical data about the database itself before deciding the best way to execute the query These decisions involve whether to use indexes, whether to use hashing, which tables to bring into main storage, what sorting technique to use, and so forth Most of the time (but not all!), it will make better decisions than a human programmer would have, simply because it has more information