Checking data consistency in the context of from-to-dimensions
Date: Sun, 28 Jun 2009 08:51:38 -0700 (PDT)
Message-ID: <098e3bf0-1fdd-4f19-9bf4-3a1693312cf4_at_z9g2000yqi.googlegroups.com>
Hello,
I was analyzing the problem of logic in connection with ranges like a
price for a certain article which is valid from a certain date till a
certain date. Especially in cases where there is not only one from-to-
dimension but several I found some results quite interesting to me. I
would like to summarize them here to share my results and to see if I
get any feedback from experts, any easier solution than the ones I
found.
The following examples are based on MySQL 5.
- The simple case If there is only one from-to-dimension like for the price of an article you do not have to give any Valid_To dates at all.
CREATE TABLE `pricelist`
(INSERT INTO `pricelist` VALUES (123, 16, '2008-01-01'); INSERT INTO `pricelist` VALUES (123, 17, '2008-07-01');
`Article_ID` INT(11) NOT NULL DEFAULT
'0' ,
`Price` DOUBLE NOT NULL DEFAULT
'0' ,
`Valid_From` DATE NOT NULL DEFAULT
'0000-00-00', PRIMARY KEY (`Article_ID`,`Valid_From`) ) TYPE=MyISAM; INSERT INTO `pricelist` VALUES (123, 15, '2007-10-01');
To find the valid price for an article at a specific date you will have to use some logic:
SELECT Price
FROM pricelist
WHERE Article_ID = 123
AND Valid_From <= '2008-02-01'
ORDER BY Valid_From DESC LIMIT 0,1;
If you add a Valid_To column this will be much easier:
CREATE TABLE `pricelist2`
(INSERT INTO `pricelist2` VALUES (123, 16, '2008-01-01', '2008-06-31'); INSERT INTO `pricelist2` VALUES (123, 17, '2008-07-01', '9999-12-31');
`Article_ID` INT(11) NOT NULL DEFAULT
'0' ,
`Price` DOUBLE NOT NULL DEFAULT
'0' ,
`Valid_From` DATE NOT NULL DEFAULT
'0000-00-00',
`Valid_To` DATE NOT NULL DEFAULT
'0000-00-00', PRIMARY KEY (`Article_ID`,`Valid_From`) ) TYPE=MyISAM; INSERT INTO `pricelist2` VALUES (123, 15, '2007-10-01', '2007-12-31');
SELECT Price
FROM pricelist2
WHERE '2008-02-01' BETWEEN Valid_From AND Valid_To;
But the disadvantage of this concept is, that you will have to make sure that the valid ranges do not overlap and, if you want to have a price for all times, that there is no date which is not within at least one Valid_From Valid_To period. As validation of this is not different to the more complex case below, I will not treat it here.
B) Multidimensional Rule Sets
Imagine that a sales agent gets a bonus payment depending on two or more rules. As more than two rules are not more complex than two rules I will assume two rules e.g. volume and customer satisfaction.
CREATE TABLE `bonusrules`
(INSERT INTO `bonusrules` VALUES (123, 2000, 20000, 0.60); INSERT INTO `bonusrules` VALUES (123, 3000, 15000, 0.75);
`Agent_ID` INT(11) NOT NULL DEFAULT
'0' ,
`Bonus` DOUBLE NOT NULL DEFAULT
'0' ,
`Min_Volume` DOUBLE NOT NULL DEFAULT
'0' ,
`Min_Satisfaction` DOUBLE NOT NULL DEFAULT
'0', PRIMARY KEY (`Agent_ID`,`Min_Volume`,`Min_Satisfaction`) ) TYPE=MyISAM; INSERT INTO `bonusrules` VALUES (123, 1000, 10000, 0.50);
Which bonus has to be paid for a volume of 30000 and a satisfaction rate of 0.80? This is not obvious, technically speaking, as the 2000 bonus payment is more demanding in terms of volume and the 3000 bonus payment is more difficult to reach in terms of customer satisfaction. Therefore we need a new logic to select the bonus like
SELECT MAX(Bonus)
FROM bonusrules
WHERE Agent_ID = 123 AND Min_Volume <= 30000 AND Min_Satisfaction <= 0.80;
In other words there is no need for upper bounds Max_Volume and Max_Satisfaction if and only if we have additional logic like the maximum principle aplied here. You should however check that your rules are consistent, that more challenging conditions always lead to a higher bonus:
SELECT a_rules.Agent_ID , a_rules.Min_Volume , a_rules.Min_Satisfaction, a_rules.Bonus , b_rules.Min_Volume , b_rules.Min_Satisfaction, b_rules.Bonus FROM bonusrules a_rules, bonusrules b_rules WHERE a_rules.Agent_ID = b_rules.Agent_ID /*Rules are not identical*/ AND NOT ( a_rules.Min_Volume = b_rules.Min_Volume AND a_rules.Min_Satisfaction = b_rules.Min_Satisfaction ) /*A is easier to reach as B*/ AND a_rules.Min_Volume <= b_rules.Min_Volume AND a_rules.Min_Satisfaction <= b_rules.Min_Satisfaction /*Bonus A is higher than Bonus B*/AND a_rules.Bonus >= b_rules.Bonus;
So let's look at an example where such an maximum principle can not be applied. Such examples are not easy to find but they exist. Imagine a bank lending money. The two parameters determining the interest rate are credit duration and credit amount. Neither increasing duration nor increasing amount will necessarily lead to higher interest rates (compare market interest rates during autumn 2008).
CREATE TABLE `interestrates`
(
`Rate` DOUBLE NOT NULL DEFAULT '0' ,
`Min_Amount` DOUBLE NOT NULL DEFAULT '0' ,
`Max_Amount` DOUBLE NOT NULL DEFAULT '0' ,
`Min_Duration` DOUBLE NOT NULL DEFAULT '0',
`Max_Duration` DOUBLE NOT NULL DEFAULT '0',
PRIMARY KEY (`Min_Amount`,`Min_Duration`) ) TYPE=MyISAM;
Through a trigger or (if available) a check constraint you have to make sure that min_amount < max_amount and min_duration < max_maxduration for all entries.
INSERT INTO `interestrates` VALUES (0.05, 0, 10000, 1, 12); INSERT INTO `interestrates` VALUES (0.06, 0, 10000, 12, 24); INSERT INTO `interestrates` VALUES (0.055, 10000, 20000, 1, 24); INSERT INTO `interestrates` VALUES (0.053, 0, 20000, 24, 36);
For convenience we assume that the lower boundaries are not part of the interval and that the bank does not lend more than 20000 nor for longer durations than 36 months.
Now we are facing two problems:
- Are there overlapping (two dimensional) intervals? Is there more than one price for a given amount / duration?
- Did we forget to define a price for a certain duration and amount?
Finding overlaps is easy if one understands what is going on. Imagine our amount-duration ranges as rectangles in the plane. Two rectangles do not overlap if rectangle A is either left, right, up or down from rectangle B. To check for overlaps run the following query:
SELECT a_rates. * ,
b_rates. *
FROM interestrates a_rates,
interestrates b_rates
WHERE NOT (
/*a right of b*/ a_rates.Min_Amount >= b_rates.Max_Amount OR /*a left of b*/ a_rates.Max_Amount <= b_rates.Min_Amount OR /*a over b*/ a_rates.Min_Duration >= b_rates.Max_Duration OR /*a under b*/ a_rates.Max_Duration <= b_rates.Min_Duration ) AND /*a <> b*/ ( a_rates.Min_Amount, a_rates.Min_Duration ) <> (b_rates.Min_Amount, b_rates.Min_Duration );
Now we have to check if we forgot to define a rate for a certain amount / duration, which turns out to be much more difficult.
If we already know that there are no overlaps, we can perform a very easy check:
SELECT SUM((Max_Amount - Min_Amount)*(Max_Duration-Min_Duration))-
(20000-0)*(36-1)
FROM interestrates;
We compare the sum of the area of the rectangles to the total rectangle of possible amounts and durations. If the result is 0 we are covered.
However in a complex scenario with many rectangles or even many more dimensions a result <> 0 will not help us finding the hole in our pricing rules. So let's look for better solution:
It is obvious that a hole in our definition set must have at least one corner of our definition rectangles on its boundaries or probably that each corner of a hole is also the corner of at least one of the definition rectangles. So it is enough to test, if there is a hole next to one of the corners of the definition rectangles.
If the corner A1 of rectangle A lies on the boundaries of another rectangle B it can either lie on a corner B1 of B or on the edge between the boundaries B1 and B2 of B. If we look at a 3-dimensional example it could also lie on a 2 dimensional area between B1, B2, B3 and B4 and so on for n-dimensional examples. To check if A1 is not neighbouring a hole in the definition set we must make sure that all four (or 2^n in the n-dimensional case) directions around A1 are covered by definition rectangles. Rectangle A will cover 1 direction. If A1 = B1, another corner, than rectangle B will cover one more direction. If A1 lies on the edge between B1 and B2, rectangle B will cover two directions. Generally speaking if A1 lies on a 2^m dimensional boundary of B then B will cover m+1 directions.
As in our example we want to cover all durations from 1 to 36 months and amounts from 0 to 20000 only we have to treat those limits separately.
Let's do so:
SELECT
test_corners.pk_min_amount
,
test_corners.pk_min_duration
,
test_corners.corner_amount
,
test_corners.corner_duration
, COUNT(DISTINCT identical_corners.corner_amount,identical_corners.corner_duration)
AS nbr_identical_corners , COUNT(DISTINCT identical_edges.edge_start_amount,identical_edges.edge_end_amount,
identical_edges.edge_start_duration,
identical_edges.edge_end_duration) AS nbr_identical_edges , COUNT(DISTINCT definition_corners.boundary_amount,definition_corners.boundary_duration)
AS nbr_definition_corners, COUNT(DISTINCT definition_edges.boundary_start_amount,definition_edges.boundary_end_amount,
definition_edges.boundary_start_duration, definition_edges.boundary_end_duration) AS nbr_definition_edges FROM
( SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , max_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , max_duration AS corner_duration FROM interestrates ) test_corners LEFT JOIN /*Corners*/ ( SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , max_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , max_duration AS corner_duration FROM interestrates ) identical_corners ON ( test_corners.pk_min_amount, test_corners.pk_min_duration ) <> (identical_corners.pk_min_amount, identical_corners.pk_min_duration) AND ( test_corners.corner_amount, test_corners.corner_duration ) = (identical_corners.corner_amount, identical_corners.corner_duration) LEFT JOIN /*Edges*/ ( /*"Left"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , min_amount AS edge_start_amount , min_duration AS edge_start_duration, min_amount AS edge_end_amount , max_duration AS edge_end_duration FROM interestrates UNION /*"Right"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , max_amount AS edge_start_amount , min_duration AS edge_start_duration, max_amount AS edge_end_amount , max_duration AS edge_end_duration FROM interestrates UNION /*"Top"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , min_amount AS edge_start_amount , min_duration AS edge_start_duration, max_amount AS edge_end_amount , min_duration AS edge_end_duration FROM interestrates UNION /*"Bottom"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , min_amount AS edge_start_amount , max_duration AS edge_start_duration, max_amount AS edge_end_amount , max_duration AS edge_end_duration FROM interestrates ) identical_edges ON ( test_corners.pk_min_amount, test_corners.pk_min_duration ) <> (identical_edges.pk_min_amount, identical_edges.pk_min_duration) AND ( /*Test Corner on "vertical" edge*/ ( identical_edges.edge_start_amount = identical_edges.edge_start_amount AND test_corners.corner_amount = identical_edges.edge_start_amount AND test_corners.corner_duration > identical_edges.edge_start_duration AND test_corners.corner_duration < identical_edges.edge_end_duration ) OR /*Test Corner on "horizontal" edge*/ (
identical_edges.edge_start_duration =
identical_edges.edge_start_duration
AND test_corners.corner_duration = identical_edges.edge_start_duration AND test_corners.corner_amount > identical_edges.edge_start_amount AND test_corners.corner_amount < identical_edges.edge_end_amount ) ) LEFT JOIN /*Definition Corners*/ ( SELECT 0 AS boundary_amount, 1 AS boundary_duration FROM dual UNION SELECT 0 AS boundary_amount, 36 AS boundary_duration FROM dual UNION SELECT 20000 AS boundary_amount, 1 AS boundary_duration FROM dual UNION SELECT 20000 AS boundary_amount, 36 AS boundary_duration FROM dual ) definition_corners ON test_corners.corner_amount = definition_corners.boundary_amount AND test_corners.corner_duration= definition_corners.boundary_duration LEFT JOIN /*Definition Edges*/ ( /*Left*/ SELECT 0 AS boundary_start_amount , 1 AS boundary_start_duration, 0 AS boundary_end_amount , 36 AS boundary_end_duration FROM dual UNION /*Right*/ SELECT 20000 AS boundary_start_amount , 1 AS boundary_start_duration, 20000 AS boundary_end_amount , 36 AS boundary_end_duration FROM dual UNION /*Top*/ SELECT 0 AS boundary_start_amount , 36 AS boundary_start_duration, 20000 AS boundary_end_amount , 36 AS boundary_start_duration FROM dual UNION /*Bottom*/ SELECT 0 AS boundary_start_amount , 1 AS boundary_start_duration, 20000 AS boundary_end_amount , 1 AS boundary_end_duration FROM dual ) definition_edges ON /*Test corner on "vertical" edge*/ ( definition_edges.boundary_start_amount = definition_edges.boundary_end_amount AND test_corners.corner_amount = definition_edges.boundary_start_amount AND test_corners.corner_duration > definition_edges.boundary_start_duration AND test_corners.corner_duration < definition_edges.boundary_end_duration ) OR /*Test corner on "horizontal" edge*/ ( definition_edges.boundary_start_duration = definition_edges.boundary_end_duration AND test_corners.corner_duration = definition_edges.boundary_start_duration AND test_corners.corner_amount > definition_edges.boundary_start_amount AND test_corners.corner_amount < definition_edges.boundary_end_amount ) GROUP BY test_corners.pk_min_amount , test_corners.pk_min_duration, test_corners.corner_amount , test_corners.corner_duration ORDER BY test_corners.pk_min_amount , test_corners.pk_min_duration, test_corners.corner_amount , test_corners.corner_duration;
Almost there! Just another wrapper and adding the final check if there is a hole in the definitions or not:
SELECT pk_min_amount , pk_min_duration , nbr_identical_corners , nbr_identical_edges , nbr_definition_corners, nbr_definition_edges , 1 + nbr_identical_corners + 2*nbr_identical_edges+3*nbr_definition_corners+2*nbr_definition_edges
check_sum
FROM
( SELECT
test_corners.pk_min_amount
,
test_corners.pk_min_duration
,
test_corners.corner_amount
,
test_corners.corner_duration
, COUNT(DISTINCT identical_corners.corner_amount,identical_corners.corner_duration)
AS nbr_identical_corners ,
COUNT(DISTINCT identical_edges.edge_start_amount,identical_edges.edge_end_amount,
identical_edges.edge_start_duration,
identical_edges.edge_end_duration) AS nbr_identical_edges , COUNT(DISTINCT definition_corners.boundary_amount,definition_corners.boundary_duration)
AS nbr_definition_corners,
COUNT(DISTINCT definition_edges.boundary_start_amount,definition_edges.boundary_end_amount,
definition_edges.boundary_start_duration, definition_edges.boundary_end_duration) AS nbr_definition_edges
FROM ( SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , max_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , max_duration AS corner_duration FROM interestrates ) test_corners LEFT JOIN /*Corners*/ ( SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , min_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, min_amount AS corner_amount , max_duration AS corner_duration FROM interestrates UNION SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration, max_amount AS corner_amount , max_duration AS corner_duration FROM interestrates ) identical_corners ON ( test_corners.pk_min_amount, test_corners.pk_min_duration ) <> (identical_corners.pk_min_amount, identical_corners.pk_min_duration) AND ( test_corners.corner_amount, test_corners.corner_duration ) = (identical_corners.corner_amount, identical_corners.corner_duration) LEFT JOIN /*Edges*/ ( /*"Left"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , min_amount AS edge_start_amount , min_duration AS edge_start_duration, min_amount AS edge_end_amount , max_duration AS edge_end_duration FROM interestrates UNION /*"Right"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , max_amount AS edge_start_amount , min_duration AS edge_start_duration, max_amount AS edge_end_amount , max_duration AS edge_end_duration FROM interestrates UNION /*"Top"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , min_amount AS edge_start_amount , min_duration AS edge_start_duration, max_amount AS edge_end_amount , min_duration AS edge_end_duration FROM interestrates UNION /*"Bottom"*/ SELECT min_amount AS pk_min_amount , min_duration AS pk_min_duration , min_amount AS edge_start_amount , max_duration AS edge_start_duration, max_amount AS edge_end_amount , max_duration AS edge_end_duration FROM interestrates ) identical_edges ON ( test_corners.pk_min_amount, test_corners.pk_min_duration ) <> (identical_edges.pk_min_amount, identical_edges.pk_min_duration) AND ( /*Test Corner on "vertical" edge*/ ( identical_edges.edge_start_amount = identical_edges.edge_start_amount AND test_corners.corner_amount = identical_edges.edge_start_amount AND test_corners.corner_duration > identical_edges.edge_start_duration AND test_corners.corner_duration < identical_edges.edge_end_duration ) OR /*Test Corner on "horizontal" edge*/ (
identical_edges.edge_start_duration =
identical_edges.edge_start_duration
AND test_corners.corner_duration = identical_edges.edge_start_duration AND test_corners.corner_amount > identical_edges.edge_start_amount AND test_corners.corner_amount < identical_edges.edge_end_amount ) ) LEFT JOIN /*Definition Corners*/ ( SELECT 0 AS boundary_amount, 1 AS boundary_duration FROM dual UNION SELECT 0 AS boundary_amount, 36 AS boundary_duration FROM dual UNION SELECT 20000 AS boundary_amount, 1 AS boundary_duration FROM dual UNION SELECT 20000 AS boundary_amount, 36 AS boundary_duration FROM dual ) definition_corners ON test_corners.corner_amount = definition_corners.boundary_amount AND test_corners.corner_duration= definition_corners.boundary_duration LEFT JOIN /*Definition Edges*/ ( /*Left*/ SELECT 0 AS boundary_start_amount , 1 AS boundary_start_duration, 0 AS boundary_end_amount , 36 AS boundary_end_duration FROM dual UNION /*Right*/ SELECT 20000 AS boundary_start_amount , 1 AS boundary_start_duration, 20000 AS boundary_end_amount , 36 AS boundary_end_duration FROM dual UNION /*Top*/ SELECT 0 AS boundary_start_amount , 36 AS boundary_start_duration, 20000 AS boundary_end_amount , 36 AS boundary_start_duration FROM dual UNION /*Bottom*/ SELECT 0 AS boundary_start_amount , 1 AS boundary_start_duration, 20000 AS boundary_end_amount , 1 AS boundary_end_duration FROM dual ) definition_edges ON /*Test corner on "vertical" edge*/ (
definition_edges.boundary_start_amount = definition_edges.boundary_end_amount
AND test_corners.corner_amount = definition_edges.boundary_start_amount AND test_corners.corner_duration > definition_edges.boundary_start_duration AND test_corners.corner_duration < definition_edges.boundary_end_duration ) OR /*Test corner on "horizontal" edge*/ (
definition_edges.boundary_start_duration =
definition_edges.boundary_end_duration
AND test_corners.corner_duration = definition_edges.boundary_start_duration AND test_corners.corner_amount > definition_edges.boundary_start_amount AND test_corners.corner_amount < definition_edges.boundary_end_amount ) GROUP BY test_corners.pk_min_amount , test_corners.pk_min_duration, test_corners.corner_amount , test_corners.corner_duration ORDER BY test_corners.pk_min_amount , test_corners.pk_min_duration, test_corners.corner_amount , test_corners.corner_duration ) test_result
WHERE 1 + nbr_identical_corners + 2*nbr_identical_edges +3*nbr_definition_corners+2*nbr_definition_edges <> 4 ;
Done!
My only problem is: This seems pretty complicated for something I
would consider a standard task of checking data consistency. So does
anybody have any comments or know an easier way to perform such a
test?
I am looking forward to your answers.
Best,
Hans