SELECT P1.ssn, P1.lastname, ..., L1.code_description FROM Lookups AS L1, Personnel AS P1 WHERE L1.code_type = 'ICD' AND L1.code_value = P1.sickness AND ...; In this sample query, I
Trang 1CHECK (CASE WHEN code_type = 'DDC' AND code_value SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1 WHEN code_type = 'ICD' AND code_value SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1 WHEN code_type = 'ISO3166' AND code_value SIMILAR TO '[A-Z][A-Z]' THEN 1 ELSE 0 END = 1),
code_description VARCHAR(255) NOT NULL, PRIMARY KEY (code_value, code_type));
Since the typical application database can have dozens and dozens of codes in it, you just keep extending this pattern for as long as required Not very pretty, is it? That is why most OTLT programmers do not bother with it, and thus destroy data integrity
The next thing you notice about this table is that the columns are pretty wide VARCHAR(n), or even worse, that they use NVARCHAR(n)
The value of (n) is most often the largest one allowed in that particular
SQL product
Since you have no idea what is going to be shoved into the table, there
is no way to predict and design with a safe, reasonable maximum size The size constraint has to be put into the WHEN clause of that second CHECK() constraint, between code_type and code_value
These large sizes tend to invite bad data You give someone a VARCHAR(n) column, and you eventually get a string with a lot of white space and a small odd character sitting at the end of it You give someone
an NVARCHAR(255) column and eventually it will get a Buddhist sutra
in Chinese Unicode I am sure of this, because I load the Diamond or Heart Sutra when I get called to evaluate a database
If you make an error in the code_type or code_description among codes with the same structure, it might not be detected You can turn 500.000 from “Natural Sciences and Mathematics” in Dewey Decimal codes into “Coal Workers Pneumoconiosis” in ICD, and vice versa This can be really difficult to find when one of the similarly structured schemes had unused codes in it
Trang 2Now let’s consider the problems with actually using the OTLT in the DML It is always necessary to add the code_type as well as the value that you are trying to look up
SELECT P1.ssn, P1.lastname, , L1.code_description
FROM Lookups AS L1, Personnel AS P1
WHERE L1.code_type = 'ICD'
AND L1.code_value = P1.sickness
AND ;
In this sample query, I need to know the code_type of the Personnel table sickness column and of every other encoded column in the table If you get a code_type wrong, you can still get a result
I also need to allow for some overhead for type conversions It would
be much more natural to use DECIMAL (6,3) for Dewey Decimal codes instead of VARCHAR(n), so that is probably how it appears in the Personnel table But why not use CHAR(7) for the code? If I had a separate table for each encoding scheme, then I would have used a FOREIGN KEY and matched the data types in the referenced and referencing tables There is no definitive guide for data type choices in the OTLT approach
When I go to execute a query, I have to pull in the entire lookup table, even if I only use one code If one code is at the start of the physical storage, and another is at the end of physical storage, I can do a lot of paging When I update the lookup table, I have to lock out everyone until I am finished It is like having to carry an encyclopedia set with you, when all you needed was a magazine article
I am going to venture a guess that this idea came from OO
programmers who think of it as some kind of polymorphism done in SQL They say to themselves that a table is a class, which it is not, and therefore it ought to have polymorphic behaviors, which it does not Maybe there are good reasons for the data-modeling principle that a well-designed table is a set of things of the same kind, instead of a pile of unrelated items
22.3 Auxiliary Function Tables
SQL is not a computational language like FORTRAN and the specialized math packages It typically does not have the numerical analysis routines
to compensate for floating-point rounding errors, or algebraic reductions
in the optimizer But it is good at joins
Trang 3Most auxiliary lookup tables are for simple decoding, but they can be used for more complex functions Let’s consider two financial
calculations that you cannot do easily: the Net Present Value (NPV) and its related Internal Rate of Return (IRR) Let me stop and ask: how would you program the NPV and IRR in SQL? The answer posted on most newsgroup replies was to write a procedure directly from the equation in the vendor-specific 4GL language and then call it
As a quick review, let’s start with the net present value (NPV) calculation Imagine that you just got an e-mail from some poor Nigerian civil servant who wants you to send him a small investment now on the promise that he will send you a series of payments over time from money he is stealing from a government bank account Obviously, you would want the total of the cash flow to be at least equal to the initial investment, or the money is not worth lending We can assume that you are making a little profit at the end of investment But is this a good investment? That is, if I took that cash flow and invested it at a given interest rate, what would the result be? That is called the net present value (NPV), and you will want to do at least as well as this value
on your investment
To make this more concrete, let’s show a little code and data for your two investment options
CREATE TABLE CashFlows (project_id CHAR(15) NOT NULL, time_period INTEGER NOT NULL, CHECK (time_period >= 0), amount DECIMAL(12,4) NOT NULL, PRIMARY KEY (project_id, time_period));
INSERT INTO CashFlows VALUES ('Acme', 0, -1000.0000), ('Acme', 1, 500.0000), ('Acme', 2,400.0000),
('Acme', 3, 200.0000), ('Acme', 4, 200.0000), ('Beta', 0, -1000.0000), ('Beta', 1, 100.0000), ('Beta',
2, 200.0000), ('Beta', 3, 200.0000), ('Beta', 4, 700.0000);
I invest $1,000 at the start of each project; the time period is zero and the amount is always negative Every year I get a different amount back
on my investment, so that at the end of the fourth year, I’ve received a
Trang 4total of $13,000 on the Acme project, less my initial $1,000, for a profit
of $12,000 Likewise, the Beta project returns $15,000 at the end Beta looks like a better investment Let’s assume we can get 10% return on an investment and that we put our cash flows into that investment The Net Present Value function in pseudocode is:
FOR t FROM 0 TO n
DO SUM(a[t]/(1.00 + r)^ t))
END FOR;
In this case, a[t] is the cash flow for time period (t) and time period (t
= 0) is the initial investment (it is always negative) and r is the interest
rate
When we run them through the equation, we find that Acme has an NPV of $71.9896 and Beta is worth −$115.4293, so Acme is really the better project We can get more out of the Acme cash flow than the Beta cash flow
22.3.1 Inverse Functions with Auxiliary Tables
The IRR depends on the NPV It finds the interest rate at which your investment would break even if you invested back into the same project Thus, if you can get a better rate, this is a good investment
Let’s build another table
CREATE TABLE Rates
(rate DECIMAL(6,4) NOT NULL PRIMARY KEY);
Now let’s populate it with some values One trick to fill the Rates table with values is to use a CROSS JOIN and keep values inside a reasonable range
CREATE TABLE Digits(digit DECIMAL (6,4) PRIMARY KEY);
INSERT INTO Digits
VALUES (0.0000), (0.0001), (0.0002), (0.0003), (0.0004),
(0.0005), (0.0006), (0.0007), (0.0008), (0.0009);
INSERT INTO Rates (rate)
SELECT DISTINCT (D1.digit *1000) + (D2.digit *100) + (D3.digit
*10) + D4.digit
FROM Digits AS D1, Digits AS D2, Digits AS D3, Digits AS D4
Trang 5WHERE ((D1.digit *1000) + (D2.digit *100) + (D3.digit *10) + D4.digit)
BETWEEN {{lower limit}} AND {{upper limit}}; pseudocode DROP TABLE Digits;
We now have two choices We can build a VIEW that uses the cash flow table, thus:
CREATE VIEW NPV_by_Rate(project_id, rate, npv) AS
SELECT CF.project_id, R1.rate, SUM(amount / POWER((1.00 + R1.rate), time_period)) FROM CashFlows AS CF, Rates AS R1
GROUP BY R1.rate, CF.project_id;
Alternately, we can set the amount in the formula to 1 and store the multiplier for the (rate, time_period) pair in another table:
INSERT INTO NPV_Mulipliers (time_period, rate, npv_multiplier) SELECT S.seq, R1.rate,
SUM(1.00/(POWER((1.00 + R1.rate), seq))) FROM Sequence AS S, Rates AS R1
WHERE S.seq <= {{ upper limit }} pseudocode GROUP BY S.seq, R1.rate;
The sequence table contains integers 1 to (n); it is a standard auxiliary
table, used to avoid iteration
Assuming we use the VIEW, the IRR is now the single query:
SELECT 'Acme', rate AS irr, npv FROM NPV_by_Rate
WHERE ABS(npv) = (SELECT MIN(ABS(npv)) FROM NPV_by_Rate) AND project_id = 'Acme';
In my sample data, I get an IRR of 13.99% at an NPV of −0.04965 for the Acme project Assume you have hundreds of projects to consider; would you rather write one query or hundreds of procedure calls? This Web site has a set of slides that deal with the use of interpolation
to find the IRR: www.yorku.ca/adms3530/Interpolation.pdf Using the
Trang 6method described on the Web site, we can write the interpolation for the Acme example as:
SELECT R1.rate + (R1.rate* (R1.npv/(R1.npv - R2.npv))) AS irr FROM NPV_by_Rate AS R1, NPV_by_Rate AS R2
WHERE R1.project_id = 'Acme'
AND R2.project_id = 'Acme'
AND R1.rate = 0.1000
AND R2.rate = 0.2100
AND R1.npv > 0
AND R2.npv < 0;
The important points are that the NPVs from R1 and R2 have to be
on both sides of the zero point, so that you can do a linear interpolation between the two rates with which they are associated
The trade-off is speed for accuracy The IRR function is slightly concave and not linear; that means that if you graph it, the shape of the curve buckles toward the origin Picking good (R1.rate, R2.rate) pairs is important, but if you want to round off to the nearest whole percentage, you probably have a larger range than you might think The answer from the original table lookup method, 0.1399, rounds to 14%,
as do all of the following interpolations
RI R2 IRR
======================
0.1000 0.2100 0.140135
0.1000 0.2000 0.143537
0.0999 0.2000 0.143457
0.0999 0.1999 0.143492
0.0800 0.1700 0.135658
The advantages of using an auxiliary function table are:
1 All host programs will be using the same calculations
2 The formula can be applied to hundreds or thousands of
projects at one time, instead of just doing one project as you would with a spreadsheet or financial calculator
Trang 7Robert J Hamilton (bobha@seanet.com) posted proprietary T-SQL functions for the NPV and IRR functions The NPV function was straightforward, but he pointed out several problems with finding the IRR
By definition, IRR is the rate at which the NPV of the cash flows equals zero When IRR is well behaved, the graph of NPV as a function of rate is a curve that crosses the x-axis once and only once When IRR is not well-behaved, the graph crosses the x-axis many times, which means the IRR is either multivalued or undefined
At this point, we need to ask what the appropriate domain is for IRR
As it turns out, NPV is defined for all possible rates, both positive and negative, except where NPV approaches an asymptote at a rate of −100%, and the power function blows up What does a negative rate mean when calculating NPV? What does it mean to have a negative IRR? Well it depends on how you look at it
If you take a mathematical approach, a negative IRR is just another solution to the equation If you take an economic approach, a negative IRR means you are losing money on the project Perhaps if you live in a deflationary economy, then a negative cash flow might be profitable in terms of real money, but that is a very unusual situation, and we can dismiss negative IRRs as unreasonable
This means that a table lookup approach to the IRR must have a very fine granularity and enough scope to cover a lot of situations for the general case It also means that the table lookup is probably not the way
to go Expressing rates to 5 or 6 decimal places is common in home mortgage finance (i.e., APR 5.6725%), and this degree of precision using the set-based approach does not scale well Moreover, this is exacerbated
by the requirements of using IRR in hyperinflationary economies, where solutions of 200%, 300%, and higher are meaningful
Here are Mr Hamilton’s functions written in SQL/PSM; one uses a straight-line algorithm, such as you find in Excel and other spreadsheets, and a bounding box algorithm The bounding box algorithm has better domain integrity, but can inadvertently “skip over” a solution when widening its search
CREATE TABLE CashFlows (t INTEGER NOT NULL CHECK (t >= 0), amount DECIMAL(12,4) NOT NULL);
CREATE TABLE Rates (rate DECIMAL(7,5) NOT NULL);
Trang 8CREATE TABLE Digits
(digit DECIMAL(6,4));
INSERT INTO Digits
VALUES (0.0000), (0.0001), (0.0002), (0.0003), (0.0004),
(0.0005), (0.0006), (0.0007), (0.0008), (0.0009);
INSERT INTO Rates
SELECT D1.digit * 1000 + D2.digit * 100 + D3.digit * 10 + D4.digit FROM Digits AS D1, Digits AS D2, Digits AS D3, Digits AS D4;
INSERT INTO Rates
SELECT rate-1 FROM Rates WHERE rate >= 0;
INSERT INTO Rates
SELECT rate-2 FROM Rates WHERE rate >= 0;
DROP TABLE Digits;
CREATE FUNCTION NPV (IN my_rate FLOAT)
RETURNS FLOAT
DETERMINISTIC
CONTAINS SQL
RETURN (CASE WHEN prevent divide by zero at rate = -100% ABS (1.0 + my_rate) >= 1.0e-5
THEN (SELECT SUM (amount * POWER ((1.0 + my_rate), -t))
FROM CashFlows)
ELSE NULL END);
CREATE FUNCTION irr_bb (IN guess FLOAT)
RETURNS FLOAT
DETERMINISTIC
CONTAINS SQL
BEGIN
DECLARE maxtry INTEGER;
DECLARE x1 FLOAT;
DECLARE x2 FLOAT;
DECLARE f1 FLOAT;
DECLARE f2 FLOAT;
DECLARE x FLOAT;
DECLARE dx FLOAT;
Trang 9DECLARE x_mid FLOAT;
DECLARE f_mid FLOAT;
initial bounding box around guess SET x1 = guess - 0.005;
SET f1 = NPV (x1);
IF f1 IS NULL THEN RETURN (f1); END IF;
SET x2 = guess + 0.005;
SET f2 = NPV (x2);
IF f2 IS NULL THEN RETURN (f2); END IF;
expand bounding box to include a solution SET maxtry = 50;
WHILE maxtry > 0 try until solution is bounded AND (SIGN(f1) * SIGN(f2)) <> -1
DO IF ABS (f1) < ABS (f2) THEN move lower bound SET x1 = x1 + 1.6 * (x1 - x2);
SET f1 = NPV (x1);
IF f1 IS NULL no irr THEN RETURN (f1);
END IF;
ELSE move upper bound SET x2 = x2 + 1.6 * (x2 - x1);
SET f2 = NPV (x2);
IF f2 IS NULL no irr THEN RETURN (f2);
END IF;
END IF;
SET maxtry = maxtry - 1;
END WHILE;
IF (SIGN(f1) * SIGN(f2)) <> -1 THEN RETURN (CAST (NULL AS FLOAT));
END IF;
END;
now find solution with binary search SET x = CASE WHEN f1 < 0
THEN x1 ELSE x2 END;
Trang 10SET dx = CASE WHEN f1 < 0
THEN (x2 - x1)
ELSE (x1 - x2) END;
SET maxtry = 50;
WHILE maxtry > 0
DO SET dx = dx / 2.0; reduce steps by half
SET x_mid = x + dx;
SET f_mid = NPV (x_mid);
IF f_mid IS NULL no irr
THEN RETURN (f_mid);
ELSE IF ABS (f_mid) < 1.0e-5 epsilon for problem
THEN RETURN (x_mid); irr found
END IF;
END IF;
IF f_mid < 0
THEN SET x = x_mid;
END IF;
SET maxtry = maxtry - 1;
END WHILE;
RETURN (CAST (NULL AS FLOAT));
END;
If you prefer to compute the IRR as a straight line, you can use this function:
CREATE FUNCTION irr_sl (IN guess FLOAT)
RETURNS FLOAT
DETERMINISTIC
CONTAINS SQL
BEGIN
DECLARE maxtry INTEGER;
DECLARE x1 FLOAT; DECLARE x2 FLOAT;
DECLARE f1 FLOAT; DECLARE f2 FLOAT;
SET maxtry = 50; iterations
WHILE maxtry > 0
DO SET x1 = guess;
SET f1 = NPV (x1);
IF f1 IS NULL no irr
THEN RETURN (f1);
ELSE IF ABS (f1) < 1.0e-5 irr within epsilon range