SQL Server MVP Deep Dives- P3

combination OrderNo and Product, I can omit the attributes CustomerID andOrderTotal, but I do need to test whether Qty or TotalPrice depend on the combina-tion of OrderNo and Price, as s

Trang 1

Second step: finding two-attribute dependencies

After following the preceding steps, I can now be sure that I’ve found all the caseswhere an attribute depends on one of the other attributes But there can also be attri-butes that depend on two, or even more, attributes In fact, I hope there are, becauseI’m still left with a few attributes that don’t depend on any other attribute If you everrun into this, it’s a sure sign of one or more missing attributes on your shortlist—one

of the hardest problems to overcome in data modeling

The method for finding multiattribute dependencies is the same as that for attribute dependencies—for every possible combination, create a sample with tworows that duplicate the columns to test and don’t duplicate any other column If atthis point I hadn’t found any dependency yet, I’d be facing an awful lot of combina-tions to test Fortunately, I’ve already found some dependencies (which you’ll find isalmost always the case if you start using this method for your modeling), so I can ruleout most of these combinations

single-Table 6 Testing functional dependencies for Qty

OrderNo CustomerID Product Qty TotalPrice OrderTotal

Table 7 Testing functional dependencies for TotalPrice

Table 8 Testing functional dependencies for OrderTotal

Trang 2

Modeling the sales order

At this point, if you haven’t already done so, you should remove attributes thatdon’t depend on the candidate key or that transitively depend on the primary key.You’ll have noticed that I already did so Not moving these attributes to their owntables now will make this step unnecessarily complex

The key to reducing the number of possible combinations is to observe that at thispoint, you can only have three kinds of attributes in the table: a single-attribute candi-date key (or more in the case of a mutual dependency), one or more attributes thatdepend on the candidate key, and one or more attributes that don’t depend on thecandidate key, or on any other attribute (as we tested all single-attribute dependen-cies) Because we already moved attributes that depend on an attribute other than thecandidate key, these are the only three kinds of attributes we have to deal with Andthat means that there are six possible kinds of combinations to consider: a candidatekey and a dependent attribute; a candidate key and an independent attribute; adependent attribute and an independent attribute; two independent attributes; twodependent attributes; or two candidate keys Because alternate keys always have amutual dependency, the last category is a special case of the one before it, so I won’tcover it explicitly Each of the remaining five possibilities will be covered below

CANDIDATE KEY AND DEPENDENT ATTRIBUTE

This combination (as well as the combination of two candidate keys, as I already tioned) can be omitted completely I won’t bother you with the mathematical proof,but instead will try to explain in language intended for mere mortals

Given three attributes (A, B, and C), if there’s a dependency from the combination

of A and B to C, that would imply that for each possible combination of values for Aand B, there can be at most one value of C But if there’s also a dependency of A to B,this means that for every value of A, there can be at most one value of B—in otherwords, there can be only one combination of A and B for every value of A; hence therecan be only one value of C for every value of A So it naturally follows that if B depends

on A, then every attribute that depends on A will also depend on the combination of

A and B, and every attribute that doesn’t depend on A can’t depend on the tion of A and B

combina-CANDIDATE KEY AND INDEPENDENT ATTRIBUTE

For this combination, some testing is required In fact, I’ll test combination first,because it’s the most common—and the sooner I find extra dependencies, the sooner

I can start removing attributes from the table, cutting down on the number of othercombinations to test

But, as before, it’s not required to test all other attributes for dependency on agiven combination of a candidate key and an independent attribute Every attributethat depends on the candidate key will also appear to depend on any combination ofthe candidate key with any other attribute This isn’t a real dependency, so there’s noneed to test for it, or to conclude the existence of such a dependency

This means that in my example, I need to test the combinations of OrderNo andProduct, OrderNo and Qty, and OrderNo and TotalPrice And when testing the first

Trang 3

combination (OrderNo and Product), I can omit the attributes CustomerID andOrderTotal, but I do need to test whether Qty or TotalPrice depend on the combina-tion of OrderNo and Price, as shown in table 9 (Also note how in this case I was able

to observe the previously-discovered business rule that TotalPrice = Qty x Price—eventhough Price is no longer included in the table, it is still part of the total collection ofdata, and still included in the domain expert’s familiar notation.)

The domain expert rejected the sample order confirmation I based on this data Asreason for this rejection, she told me that obviously, the orders for 10 and 12 units ofGizmo should’ve been combined on a single line, as an order for 22 units of Gizmo, at

a total price of $375.00 This proves that Qty and TotalPrice both depend on the bination of OrderNo and Product Second normal form requires me to create a newtable with the attributes OrderNo and Product as key attributes, and Qty and Total-Price as dependent attributes I’ll have to continue testing in this new table for two-attribute dependencies for all remaining combinations of two attributes, but I don’thave to repeat the single-attribute dependencies, because they’ve already been testedbefore the attributes were moved to their own table For the orders table, I now haveonly the OrderNo, CustomerID, and OrderTotal as remaining attributes

com-TWO DEPENDENT ATTRIBUTES

This is another combination that should be included in the tests Just as with a singledependent attribute, you’ll have to test the key attribute (which will be dependent onthe combination in case of a mutual dependency, in which case the combination is analternate key) and the other dependent attributes (which will be dependent on thecombination in case of a transitive dependency)

In the case of my sample Orders table,

I only have two dependent attributes left(CustomerID and OrderTotal), so there’sonly one combination to test And theonly other attribute is OrderID, the key

So I create the test population of table 10

to check for a possible alternate key

The domain expert saw no reason toreject this example (after I populated therelated tables with data that observes all rules discovered so far), so there’s obviously

no dependency from CustomerID and OrderTotal to OrderNo

Table 9 Testing functional dependencies for the combination of OrderNo and Product

Trang 4

Modeling the sales order

TWO INDEPENDENT ATTRIBUTES

Because the Orders table used in my example has no independent columns anymore,

I can obviously skip this combination But if there still were two or more independentcolumns left, then I’d have to test each combination for a possible dependency of acandidate key or any other independent attribute upon this combination

DEPENDENT AND INDEPENDENT ATTRIBUTES

This last possible combination is probably the least common—but there are caseswhere an attribute turns out to depend on a combination of a dependent and an inde-pendent attribute Attributes that depend on the key attribute can’t also depend on acombination of a dependent and an independent column (see the sidebar a few pagesback for an explanation), so only candidate keys and other independent attributesneed to be tested

Further steps: three-and-more-attribute dependencies

It won’t come as a surprise that you’ll also have to test for dependencies on three ormore attributes But these are increasingly rare as the number of attributes increases,

so you should make a trade-off between the amount of work involved in testing all sible combinations on one hand, and the risk of missing a dependency on the other.The amount of work involved is often fairly limited, because in the previous stepsyou’ll often already have changed the model from a single many-attribute relation to acollection of relations with only a limited number of attributes each, and hence with alimited number of possible three-or-more-attribute combinations

For space reasons, I can’t cover all possible combinations of three or more butes here But the same logic applies as for the two-attribute dependencies, so if youdecide to go ahead and test all combinations you should be able to figure out for your-self which combinations to test and which to skip

attri-What if I have some independent attributes left?

At the end of the procedure, you shouldn’t have any independent attributesleft—except when the original collection of attributes was incomplete Let’s forinstance consider the order confirmation form used earlier—but this time, there may

be multiple products with the same product name but a different product ID In thiscase, unless we add the product ID to the table before starting the procedure, we’llend up with the attributes Product, Qty, and Price as completely independent col-umns in the final result (go ahead, try it for yourself—it’s a great exercise!)

So if you ever happen to finish the procedure with one or more independent umns left, you’ll know that either you or the domain expert made a mistake when pro-ducing and assessing the collections of test sample data, or you’ve failed to identify atleast one of the candidate key attributes

Trang 5

I’ve shown you a method to find all functional dependencies between attributes Ifyou’ve just read this chapter, or if you’ve already tried the method once or twice, itmay seem like a lot of work for little gain But once you get used to it, you’ll find thatthis is very useful, and that the amount of work is less than it appears at first sight For starters, in a real situation, many dependencies will be immediately obvious ifyou know a bit about the subject matter, and it’ll be equally obvious that there are nodependencies between many attributes There’s no need to verify those with thedomain expert (Though you should keep in mind that some companies may have aspecific situation that deviates from the ordinary.)

Second, you’ll find that if you start by testing the dependencies you suspect to bethere, you’ll quickly be able to divide the data over multiple relations with relativelyfew attributes each, thereby limiting the number of combinations to be tested And finally, by cleverly combining multiple tests into a single example, you canlimit the number of examples you have to run by the domain expert This may notreduce the amount of work you have to do, but it does reduce the number of exam-ples your domain expert has to assess—and she’ll love you for it!

As a bonus, this method can be used to develop sample data for unit testing, whichcan improve the quality of the database schema and stored procedures

A final note of warning—there are some situations where, depending on the orderyou choose to do your tests, you might miss a dependency You can find them too, butthey’re beyond the scope of this chapter Fortunately this will only happen in caseswhere rare combinations of dependencies between attributes exist, so it’s probablybest not to worry too much about it

About the author

Hugo is cofounder and R&D lead of perFact BV, a Dutch pany that strives to improve analysis methods and to developcomputer-aided tools that will generate completely functionalapplications from the analysis deliverable The chosen platformfor this development is SQL Server

In his spare time, Hugo likes to share and enhance hisknowledge of SQL Server by frequenting newsgroups andforums, reading and writing books and blogs, and attending andspeaking at conferences

Trang 6

P ART 2

Database Development

Edited by Adam Machanic

It can be argued that database development, as an engineering discipline, wasborn along with the relational model in 1970 It has been almost 40 years (as Iwrite these words), yet the field continues to grow and evolve—seemingly at afaster rate every year This tremendous growth can easily be seen in the manyfacets of the Microsoft database platform SQL Server is no longer just a simple

SQL database system; it has become an application platform, a vehicle for thecreation of complex and multifaceted data solutions

Today’s database developer is expected to understand not only the

Transact-SQL dialect spoken by SQL Server, but also the intricacies of the many nents that must be controlled in order to make the database system do their bid-ding This variety can be seen in the many topics discussed in the pages ahead:indexing, full-text search, SQLCLR integration, XML, external interfaces such as

compo-ADO.NET, and even mobile device development are all subjects within the realm

of database development

The sheer volume of knowledge both required and available for tion can seem daunting, and giving up is not an option The most importantthing we can do is understand that while no one can know everything, we canstrive to continually learn and enhance our skill sets, and that is where this bookcomes in The chapters in this section—as well as those in the rest of thebook—were written by some of the top minds in the SQL Server world, andwhether you’re just beginning your journey into the world of database develop-ment or have several years of experience, you will undoubtedly learn somethingnew from these experts

Trang 7

consump-amazing group of writers, and I sincerely hope that you will thoroughly enjoy theresults of our labor I wish you the best of luck in all of your database developmentendeavors Here’s to the next 40 years.

About the editor

Adam Machanic is a Boston-based independent database sultant, writer, and speaker He has written for numerous web-sites and magazines, including SQLblog, Simple Talk, Search

con-SQL Server, SQL Server Professional, CODE , and VSJ He has also

contributed to several books on SQL Server, including SQL Server

2008 Internals (Microsoft Press, 2009) and Expert SQL Server 2005 Development (Apress, 2007) Adam regularly speaks at user

groups, community events, and conferences on a variety of SQL

Server and NET-related topics He is a Microsoft Most ValuableProfessional (MVP) for SQL Server, Microsoft Certified IT Pro-fessional (MCITP), and a member of the INETA North AmericanSpeakers Bureau

Trang 8

Those impressions are both wrong.

Iterative code isn’t always bad (though, in all honesty, it usually is) And there’s

more to SQL Server than declarative or iterative—there are ways to combine them,adding their strengths and avoiding their weaknesses This article is about one such

method: set-based iteration.

The technique of set-based iteration can lead to efficient solutions for problemsthat don’t lend themselves to declarative solutions, because those would result in

an amount of work that grows exponentially with the amount of data In thosecases, the trick is to find a declarative query that solves a part of the problem (asmuch as feasible), and that doesn’t have the exponential performance prob-lem—then repeat that query until all work has been done So instead of attempting

a single set-based leap, or taking millions of single-row-sized miniature steps in acursor, set-based iteration arrives at the destination by taking a few seven-mile leaps

In this chapter, I’ll first explain the need for an extra alternative by discussingthe weaknesses and limitations of purely iterative and purely declarative coding I’llthen explain the technique of set-based iteration by presenting two examples: first

a fairly simple one, and then a more advanced case

The common methods and their shortcomings

Developing SQL Server code can be challenging You have so many ways to achievethe same result that the challenge isn’t coming up with working code, but pickingthe “best” working code from a bunch of alternatives So what’s the point of addingyet another technique, other than making an already tough choice even harder?

Trang 9

The answer is that there are cases (admittedly, not many) where none of the existingoptions yield acceptable performance, and set-based iteration does.

Declarative (set-based) code

Declarative coding is, without any doubt, the most-used way to manipulate data in SQL

Server And for good reason, because in most cases it’s the fastest possible code

The basic principle of declarative code is that you don’t tell the computer how to process the data in order to create the required results, but instead declare the results you want and leave it to the DBMS to figure out how to get those results Declarative code is

also called set-based code because the declared required results aren’t based on

individ-ual rows of data, but on the entire set of data

For example, if you need to find out which employees earn more than their ager, the declarative answer would involve one single query, specifying all the tablesthat hold the source data in its FROM clause, all the required output columns in itsSELECT clause, and using a WHERE clause to filter out only those employees that meetthe salary requirement

man-BENEFITS

The main benefit of declarative coding is its raw performance For one thing, SQL

Server has been heavily optimized toward processing declarative code But also, thequery optimizer—the SQL Server component that selects how to process eachquery—can use all the elements in your database (including indexes, constraints, andstatistics on data distribution) to find the most efficient way to process your request,and even adapt the execution plan when indexes are added or statistics indicate amajor change in data distribution

Another benefit is that declarative code is often much shorter and (once you getthe hang of it) easier to read and maintain than iterative code Shorter, easier-to-readcode directly translates into a reduction of development cost, and an even largerreduction of future maintenance cost

DRAWBACKS

Aside from the learning curve for people with a background in iterative coding,there’s only one problem with the set-based approach Because you have to declarethe results in terms of the original input, you can’t take shortcuts by specifying endresults in terms of intermediate results In some cases, this results in queries that areawkward to write and hard to read In other cases, it may result in queries that force

SQL Server to do more work than would otherwise be required

Running totals is an example of this There’s no way to tell SQL Server to calculatethe running total of each row as the total of the previous row plus the value of the cur-rent row, because the running total of the previous row isn’t available in the input,and partial query results (even though SQL Server does know them) can’t be specified

in the language

The only way to calculate running totals in a set-based fashion is to specify eachrunning total as the sum of the values in all preceding rows That implies that a lot

Trang 10

The common methods and their shortcomings

more summation is done than would be required if intermediate results were able This results in performance that degrades exponentially with the amount of

avail-data, so even if you have no problems in your test environment, you will have

prob-lems in your 100-million-row production database!

Iterative (cursor-based) code

The base principle of iterative coding is to write T-SQL as if it were just another generation programming language, like C#, VB.NET, Cobol, and Pascal In those lan-guages, the only way to process a set of data (such as a sequentially organized file) is toiterate over the data, reading one record at a time, processing that record, and thenmoving to the next record until the end of the file has been reached SQL Server has

third-cursors as a built-in mechanism for this iteration, hence the term cursor-based code as an alternative to the more generic iterative code.

Most iterative code encountered “in the wild” is written for one of two reasons:either because the developer was used to this way of coding and didn’t know how (orwhy!) to write set-based code instead; or because the developer was unable to find agood-performing set-based approach and had to fall back to iterative code to getacceptable performance

BENEFITS

A perceived benefit of iterative code might be that developers with a background inthird-generation languages can start coding right away, instead of having to learn aradically different way to do their work But that argument would be like someonefrom the last century suggesting that we hitch horses to our cars so that drivers don’thave to learn how to start the engine and operate the steering wheel

Iterative code also has a real benefit—but only in a few cases Because the coderhas to specify each step SQL Server has to take to get to the end result, it’s easy to store

an intermediate result and reuse it later In some cases (such as the running totalsalready mentioned), this can result in faster-running code

DRAWBACKS

By writing iterative code, you’re crippling SQL Server’s performance in two ways at thesame time You not only work around all the optimizations SQL Server has for fast set-based processing, you also effectively prevent the query optimizer from coming upwith a faster way to achieve the same results Tell SQL Server to read employees, and

Running totals in the OVER clause

The full ANSI standard specification of the OVER clause includes windowing sions that allow for simple specification of running totals This would result in shortqueries with probably very good performance—if SQL Server had implemented them.Unfortunately, these extensions aren’t available in any current version of SQL Server,

exten-so we still have to code the running totals ourselves

Trang 11

for each employee read the details of his or her department, and that’s exactly what’llhappen But tell SQL Server that you want results of employees and departments com-bined, and that’s only one of the options for the query optimizer to consider.

Set-based iteration

An aspect that’s often overlooked in the “set-based or cursor” discussion is that theyrepresent two extremes, and there’s plenty of room for alternate solutions inbetween Iterative algorithms typically use one iteration for each row in the table orquery that the iteration is based on, so the number of iterations is always equal to thenumber of rows, and the amount of work done by a single execution of the body ofthe iteration equates to processing a single row Set-based code goes to the otherextreme: processing all rows at once, in a single execution of the code Why limit our-selves to choosing either one execution that processes N rows, or N executions thatprocess one row each?

The most basic form

The most basic form of set-based iteration isn’t used to prevent exponential mance scaling, but to keep locking short and to prevent the transaction log from over-flowing This technique is often recommended in newsgroups when UPDATE or DELETEstatements that affect a large number of rows have to be run To prevent long-lastinglocks, lock escalation, and transaction log overflow, the TOP clause is used (or SETROWCOUNT on versions older than SQL Server 2005) to limit the number of rows pro-cessed in a single iteration, and the statement is repeated until no more rows areaffected An example is shown in listing 1, where transaction history predating theyear 2005 is removed in chunks of 10,000 rows (Note that this example, like all otherexamples in this chapter, should run on all versions from SQL Server 2005 upward.)

perfor-SET NOCOUNT ON;

DECLARE @BatchSize int, @RowCnt int;

SET @BatchSize = 10000;

SET @RowCnt = @BatchSize;

WHILE @RowCnt = @BatchSize BEGIN;

DELETE TOP (@BatchSize) FROM TransactionHistory WHERE TranDate < '20050101';

SET @RowCnt = @@ROWCOUNT;

Trang 12

In this example, I’ll use the AdventureWorks sample database to report all sales,arranged by customer, ordered by date, and with a running total of all order amountsfor a customer up to and including that date Note that the Microsoft-supplied sampledatabase is populated with more than 31,000 orders for over 19,000 customers, andthat the highest number of orders for a single customer is 28.

DECLARATIVE CODE

In current versions of SQL Server, the only way to calculate running totals in tive code is to join each row from the table to all preceding rows for the same cus-tomer from itself, adding all those joined rows together to calculate the running total.The code for this is shown in listing 2

declara-USE AdventureWorks;

SET NOCOUNT ON;

SELECT s.CustomerID, s.OrderDate, s.SalesOrderID, s.TotalDue, SUM(s2.TotalDue) AS RunningTotal

FROM Sales.SalesOrderHeader AS s INNER JOIN Sales.SalesOrderHeader AS s2

ON s2.CustomerID = s.CustomerID AND( s2.OrderDate < s.OrderDate OR( s2.OrderDate = s.OrderDate AND s2.SalesOrderID <= s.SalesOrderID)) GROUP BY s.CustomerID, s.OrderDate, s.SalesOrderID, s.TotalDue ORDER BY s.CustomerID, s.OrderDate, s.SalesOrderID;

The performance of this query depends on the average number of rows in the join In this case, the average is less than 2, resulting in great performance: approxi-mately 0.2 seconds on my laptop But if you adapt the code to produce running totalsper sales territory instead of per customer (by replacing all occurrences of the columnname CustomerID with TerritoryID), you’re in for a nasty surprise: with only 10 differ-ent territories in the database, the average number of rows in the self-join is muchhigher And because performance in this case degrades exponentially, not linearly,the running time on my laptop went up to over 10 minutes (638 seconds, to be exact)!

self-ITERATIVE CODE

Because the declarative running totals code usually performs poorly, this problem iscommonly solved with iterative code, using a server-side cursor Listing 3 shows thecode typically used for this

Listing 2 Declarative code for calculating running totals

SalesOrderID used

as tie breaker

Trang 13

USE AdventureWorks;

SET NOCOUNT ON;

DECLARE @Results TABLE (CustomerID int NOT NULL, OrderDate datetime NOT NULL, SalesOrderID int NOT NULL, TotalDue money NOT NULL, RunningTotal money NULL, PRIMARY KEY (CustomerID, OrderDate, SalesOrderID));

INSERT INTO @Results(CustomerID, OrderDate, SalesOrderID, TotalDue) SELECT CustomerID, OrderDate, SalesOrderID, TotalDue

FROM Sales.SalesOrderHeader;

DECLARE @CustomerID int, @OrderDate datetime, @SalesOrderID int, @TotalDue money, @CurrCustomerID int, @RunningTotal money;

SET @CurrCustomerID = 0;

SET @RunningTotal = 0;

DECLARE SalesCursor CURSOR STATIC READ_ONLY FOR SELECT CustomerID, OrderDate, SalesOrderID, TotalDue FROM @Results

ORDER BY CustomerID, OrderDate, SalesOrderID; C

OPEN SalesCursor;

FETCH NEXT FROM SalesCursor INTO @CustomerID, @OrderDate, @SalesOrderID, @TotalDue;

WHILE @@FETCH_STATUS = 0 BEGIN;

IF @CustomerID <> @CurrCustomerID BEGIN;

SET @CurrCustomerID = @CustomerID;

SET @RunningTotal = 0;

END;

SET @RunningTotal = @RunningTotal + @TotalDue;

UPDATE @Results SET RunningTotal = @RunningTotal WHERE CustomerID = @CustomerID AND OrderDate = @OrderDate AND SalesOrderID = @SalesOrderID;

FETCH NEXT FROM SalesCursor INTO @CustomerID, @OrderDate, @SalesOrderID, @TotalDue;

ORDER BY CustomerID, OrderDate, SalesOrderID;

Listing 3 Iterative code for calculating running totals

Trang 14

The code is pretty straightforward In order to get all results as one result set, a tablevariable B is used to store the base data and the calculated running totals The pri-mary key on the table variable is there primarily to create a good clustered index forthe iteration, which explains why it includes more columns than the key (which is onSalesOrderID only) The only way to index a table variable is to add PRIMARY KEY orUNIQUE constraints to it

A T-SQL cursor is then used to iterate over the rows For each row, the variableholding the running total is incremented with the total of that order and then stored

in the results table E, after resetting the running total to 0 when the customerchanges D The ORDER BY of the cursor C ensures that the data is processed in theproper order, so that the calculated running totals will be correct

On my laptop, this code takes 1.9 seconds That’s slower than the declarative sion presented earlier But if I change the code to calculate running totals per terri-tory, the running time remains stable at 1.9 seconds This shows that, even though thedeclarative solution is faster when the average number of rows in the self-join is low,the iterative solution is faster at all other times, with the added benefit of stable andpredictable performance Almost all processing time is for fetching the order rows, sothe performance will grow linearly with the amount of data

ver-SET-BASED ITERATION

For each customer, the running total of her first order is equal to the order total Therunning total of the second order is then equal to the order total plus the first run-ning total, and so on This is the key to a solution that uses set-based iteration to deter-mine the running total for the first orders of all customers, then calculate all secondrunning totals, and so forth

This algorithm, for which the code is shown in listing 4, needs as many iterations asthe highest number of orders for a single customer—28 in this case Each individualiteration will probably be slower than a single iteration of the iterative solution, butbecause the number of iterations is reduced from more than 30,000 to 28, the totalexecution time is faster

USE AdventureWorks;

SET NOCOUNT ON;

DECLARE @Results TABLE (CustomerID int NOT NULL, OrderDate datetime NOT NULL, SalesOrderID int NOT NULL, TotalDue money NOT NULL, RunningTotal money NULL, Rnk int NOT NULL, PRIMARY KEY (Rnk, CustomerID));

INSERT INTO @Results (CustomerID, OrderDate, SalesOrderID, TotalDue, RunningTotal, Rnk)

Listing 4 Set-based iteration for calculating running totals

B

C

Trang 15

SELECT CustomerID, OrderDate, SalesOrderID, TotalDue, TotalDue,

RANK() OVER (PARTITION BY CustomerID ORDER BY OrderDate, SalesOrderID) FROM Sales.SalesOrderHeader;

DECLARE @Rank int, @RowCount int;

SET @Rank = 1;

SET @RowCount = 1;

WHILE @RowCount > 0 BEGIN;

SET @Rank = @Rank + 1;

UPDATE nxt SET RunningTotal = prv.RunningTotal + nxt.TotalDue FROM @Results AS nxt

INNER JOIN @Results AS prv

ON prv.CustomerID = nxt.CustomerID AND prv.Rnk = @Rank- 1 WHERE nxt.Rnk = @Rank;

SET @RowCount = @@ROWCOUNT;

END;

SELECT CustomerID, OrderDate, SalesOrderID, TotalDue, RunningTotal FROM @Results

ORDER BY CustomerID, OrderDate, SalesOrderID;

Just as in the iterative code, a table variable B is used to store the base data and thecalculated running totals In this case, that’s not only to enable all results to bereturned at once and in the expected order, but also because we need to store inter-mediate results and reuse them later

During the initial population C of the results table, I calculate and store the rank

of each order This is more efficient than calculating it in each iteration, because thisalso allows me to base the clustered index on this rank It’s possible to code this algo-rithm without materializing the rank in this table, but that makes the rest of the codemore complex, and (most important) hurts performance in a big way!

While populating the table variable, I also set the running total for each orderequal to its order total This is, of course, incorrect for all except the first orders, but itsaves the need for a separate UPDATE statement for the first orders, and the runningtotals for all other orders will eventually be replaced later in the code

The core of this algorithm is the UPDATE statement D that joins a selection of allorders with the next rank to those of the previous rank, so that the next running totalcan be set to the sum of the previous running total and the next order total

On my laptop, this code runs in 0.4 seconds This speed depends not only on theamount of data, but also on the required number of iterations If I change the code

to calculate running totals per territory rather than per customer, the number of

C

D

Trang 16

iterations goes up to almost 7,000, causing the execution time to rise toapproximately 0.9 seconds And if I change the code to calculate overall runningtotals (forcing the number of iterations to be equal to the number of rows), theclock stops at 2 seconds

The bottom line is that, even though declarative code runs slightly faster in caseswith a very low iteration count and iterative code is slightly better for very high itera-tion counts, set-based iteration presents a good algorithm that’s the fastest in many sit-uations and only slightly slower in the other cases

Bin packing

The bin-packing problem describes a category of related problems In its shortest form, it

can be expressed as “given an unlimited supply of bins, all having the same capacity,and a collection of packages, find a way to combine all packages in the least number

of bins.”

The bin-packing problem is sometimes thought to be mainly academic, of interestfor mathematicians only That’s a misconception, as there are many business situa-tions that are a variation on the bin-packing problem:

Transport —You have to transport five packages from Amsterdam to Paris The

packages weigh two, three, four, five, and six tons The maximum capacity of asingle truck is 10 tons You can, of course, place the first three packages in a sin-gle truck without exceeding the maximum weight, but than you’d need twoextra trucks for the last two packages With this small amount of data, it’s obvi-ous that you can get them transported in two trucks if you place the packages offour and six tons in one truck and the other three in the second But if thereare 400 packages, it becomes too hard for a human to see how to spare one ortwo trucks, and computerized assistance becomes crucial

Seating groups —Imagine a theatre with 40 rows of 30 seats each If a group

makes a reservation, they’ll expect to get adjacent seats on a single row But ifyou randomly assign groups to rows, you have a high chance that you’ll end upwith two or three empty seats in each row and a group of eight people who can’tget adjacent seats anymore If you can find a more efficient way to assign seats

to groups, you might free up eight adjacent seats on one row and sell an extraeight tickets

Minimizing cut loss —Materials such as cable and fabric are usually produced on

rolls of a given length If a builder needs to use various lengths of cable, or astore gets orders for various lengths of fabric, they don’t want to be left with one

or two meters from each roll and still have to use a new roll for the last requiredlength of six meters

According to mathematicians, you can only be 100 percent sure that you get the lute minimum number of bins by trying every possible permutation It’s obvious that,however you implement this, it’ll never scale, as the number of possible permutations

Trang 17

abso-grows exponentially with the number of packages Most businesses will prefer analgorithm that produces a “very good” distribution in a few seconds over one that mightsave two or three bins by finding the “perfect” solution after running for a few days.

DECLARATIVE CODE

I’ve never found a set-based approach to finding a “good enough” solution for thebin-packing problem But I’ve found set-based code that finds the “perfect” solution.This code was originally posted by John Gilson; I’ve corrected and optimized this codeand then published it to my blog (http://sqlblog.com/blogs/hugo_kornelis/archive/2008/10/27/bin-packing-part-4-the-set-based-disaster.aspx), but it’s too large to repro-duce here There’s no reason to, either, because this code can never be used in prac-tice—not only because it couples already-bad performance with the ugliestexponential growth curve I’ve ever seen, but also because it requires extra columns inintermediate result sets and many extra lines of code as the bin size and the number

of packages increases, such that for real-world problems, you’d need millions of lines

of code (and a version of SQL Server that allows more than 4,096 columns per SELECTstatement) And then you’ll still get execution times measured in days, if not years

ITERATIVE CODE

Because a set-based solution for the bin-packing problem is way too slow, even in casesthat are limited enough that such a solution is even possible, we need to investigateother options And the most obvious alternative is an iterative solution Of all the pos-sible strategies I investigated (see my blog for the details), I found that the best combi-nation of speed and packing efficiency is attained by an algorithm that stays close tohow I’d pack a bunch of physical packages into physical bins: take a bin, keep addingpackages to it until it overflows, then start with a new bin unless the overflowing pack-age fits into one of the other already filled bins Listing 5 shows the code to set up thetables and fill them with some randomly generated data, and listing 6 shows the T-SQL

version of this algorithm

SET NOCOUNT ON;

IF OBJECT_ID('dbo.Packages', 'U') IS NOT NULL BEGIN;

DROP TABLE dbo.Packages;

END;

CREATE TABLE dbo.Packages (PackageNo int NOT NULL IDENTITY PRIMARY KEY, Size smallint NOT NULL,

BinNo int DEFAULT NULL);

DECLARE @NumPackages int, @Loop int;

Trang 18

SET NOCOUNT ON;

DECLARE @BinSize smallint ,@PackageNo int ,@Size smallint ,@CurBinNo int ,@CurSpaceLeft smallint ,@BinNo int;

CREATE INDEX ix_Bins ON dbo.Bins(SpaceLeft);

SET @CurBinNo = 1;

SET @CurSpaceLeft = @BinSize;

INSERT INTO dbo.Bins (BinNo, SpaceLeft) VALUES (@CurBinNo, @CurSpaceLeft);

DECLARE PackageCursor CURSOR STATIC FOR SELECT PackageNo, Size

FROM dbo.Packages;

OPEN PackageCursor;

FETCH NEXT FROM PackageCursor INTO @PackageNo, @Size;

WHILE @@FETCH_STATUS = 0 BEGIN;

IF @CurSpaceLeft >= @Size BEGIN;

SET @BinNo = @CurBinNo;

END;

ELSE BEGIN;

SET @BinNo = (SELECT TOP (1) BinNo FROM dbo.Bins WHERE SpaceLeft >= @Size AND BinNo <> @CurBinNo ORDER BY SpaceLeft);

Listing 6 Iterative code for bin packing

Generates random number between

2 and 60

Variables for cursor data Variables for current bin

Stored for extra performance

Start with empty current bin

B

C

D

Trang 19

IF @BinNo IS NULL BEGIN;

UPDATE dbo.Bins SET SpaceLeft = @CurSpaceLeft WHERE BinNo = @CurBinNo;

SET @CurBinNo = @CurBinNo + 1;

SET @CurSpaceLeft = @BinSize;

INSERT INTO dbo.Bins (BinNo, SpaceLeft) VALUES (@CurBinNo, @CurSpaceLeft);

SET @BinNo = @CurBinNo;

END;

UPDATE dbo.Packages SET BinNo = @BinNo WHERE PackageNo = @PackageNo;

IF @BinNo = @CurBinNo BEGIN;

SET @CurSpaceLeft = @CurSpaceLeft - @Size;

END;

ELSE BEGIN;

UPDATE dbo.Bins SET SpaceLeft = SpaceLeft - @Size WHERE BinNo = @BinNo;

END;

FETCH NEXT FROM PackageCursor INTO @PackageNo, @Size;

END;

IF @CurBinNo IS NOT NULL BEGIN;

UPDATE dbo.Bins SET SpaceLeft = @CurSpaceLeft WHERE BinNo = @CurBinNo;

The main logic is coded in the WHILE loop For every package, I first check whether thecurrent bin has enough room left C If not, I check whether the package would fit one

of the other already partly filled bins D before creating a new bin for it E To savetime, I don’t write the data for the current bin to disc after each package, but I pay forthis by having to write it at two slightly less logical locations in the code F—when anew bin is started, or (for the last bin) after the last package has been assigned This algorithm is fast, because adding several packages to the same bin right aftereach other saves on the overhead of switching between bins It’s also efficient because,

E F

Trang 20

even if a large package forces me to start a new bin when the previous one is still halfempty, that half-empty bin will still be reconsidered every time a package would over-flow the current bin, so it should eventually fill up There’s no ORDER BY specified inthe cursor definition B Adding ORDER BY BinSize DESC will improve the packing effi-ciency by about 4 percent, but at the cost of a performance hit that starts at 5–10 per-cent for small amounts of test data (10,000–50,000 packages), but grows to more than

20 percent for 500,000 packages

When I tested this code on my laptop, it was able to pack 100,000 packages in bins

in approximately 143 seconds The running time went up to 311 seconds for 200,000packages, and to 769 seconds for 500,000 packages The growth is much better thanexponential, but worse than linear, probably due to the increasing cost of checking anever-increasing number of partly filled bins when a package would overflow the cur-rent bin

Some extrapolation of my test results indicates that a run with a million packageswill probably take half an hour, and maybe 6 or 7 hours are needed to pack ten mil-lion packages This sure beats packing the bins by hand, but it might not be fastenough in all situations

SET-BASED ITERATION

In those situations where the iterative solution isn’t fast enough, we need to findsomething faster The key here is that it’s easy to calculate an absolute minimum num-ber of bins—if, for example, the combined size of all packages is 21,317 and the binsize is 100, then we can be sure that there will never be a solution with less than 214bins—so why not start off with packing 214 bins at once?

I start by finding the 214 largest packages and putting one in each of the 214 able bins After that, I rank the bins by space remaining, rank the packages (excludingthose that are already too large for any bin) by size, match bins and packages by rank,and add packages that will still fit into their matching bins I then repeat this step untilthere are no packages left that fit in the remaining space of an available bin (eitherbecause all packages are packed, or they’re all larger than the largest free space) Ideally, all packages have now been catered for In reality, there will often be caseswhere not all packages can be handled in a single pass—so I then repeat this process,

avail-by summing the total size of the remaining packages, dividing avail-by the bin size, ing that number of bins and repeatedly putting packages into bins until no morepackages that fit in a bin are left This second pass is often the last Sometimes a thirdpass can be required

The code in listing 8 shows a SQL Server 2005–compatible implementation of thisalgorithm (On SQL Server 2008, the UPDATE FROM statement can be replaced withMERGE for better ANSI compatibility, though at the cost of slightly slower perfor-mance.) This code uses a numbers table—a table I believe should exist in every data-base, as it can be used in many situations Listing 7 shows how to make such a tableand fill it with numbers 1 through 1,000,000 Note that creating and filling the num-bers table is a one-time operation!

Định dạng
Số trang	40
Dung lượng	445,51 KB