Kenneth Downs
Advanced Algorithm: Sequencing Dependencies
Some database applications require you to perform a series of actions where you know only that some actions must be performed before others. Before you can perform the actions, you must work out a safe sequence that takes into account all of the dependencies. This week in The Database Programmer we will see an algorithm for doing this.
ExamplesThere are many examples where a programmer must work out dependencies before doing something.
A manufacturing package may track many steps in the manufacture of an item. Some steps cannot be performed until others are complete. A simple system would require the end-user to work out the entire process, but a better system would let the user enter only the dependencies: which processes require others to be complete. In this kind of system the computer can be used to schedule manufacturing tasks.
All popular Linux distributions have a package installation system in which each package lists its required dependencies. If you want to install a large number of packages in one shot, producing a tangled bunch of related dependencies, today's algorithm can be used to work them all out.
If you are using a data dictionary to build tables, every foreign key represents a dependency, where the child table requires the parent table to exist before it can be built. Today's algorithm can be used to sequence the tables and build them in order.
Another database example is generating code to perform calculations. Some calculations will depend on previous calculations, so your code generator must be able to sequence them all so that the calculations are performed in the proper order.
Big Words: Directed Acyclic GraphThe examples abvoe are all cases of what mathematicians call a Directed Acyclic Graph. If you do not want to read the entire Wikipedia article, the main points are these:
- We have a set of items. These can be anything you are keeping track of in your database.
- Any item may be connected to zero or more other items.
- The connection is one-way only. So if we say A requires B, we are not saying that B also requires A (in fact it is forbidden).
- There can be no loops (cycles). If A requires B, B may not require A. Further, if A requires B, and B requires C, C may not require A.
Whenever I can, I like to point out that it is very useful to read up on the mathematical foundations of certain programming techniques. We can often pick up very useful insights from those who think of these things at the most abstract level. It is also much easier to get advice from the more abstract-minded database people if you are at least marginally familiar with the mathematical terms.
The TablesSo now let us proceed to the tables and the code. The tables below show a data dictionary that will be used to generate DDL to build a database:
Table: TABLES TABLE | DESCRIPTION | SEQUENCE ------------+------------------------+--------- ORDERS | Sales Orders Headers | ? ORDER_LINES | Sales order lines | ? CUSTOMERS | Customers | ? ITEMS | Items | ? Table: DEPENDENCIES CHILD_TABLE | PARENT_TABLE -------------+--------------- ORDERS | CUSTOMERS ORDER_LINES | ORDERS ORDER_LINES | ITEMS
The problem here is knowing the safe order in which to build the tables. If I try to build ORDER_LINES before I have built ITEMS, then I cannot put a foreign key onto ORDER_LINES, because ITEMS is not there. In short, I need to know the value of the SEQUENCE column in the example above.
The Expected AnswerThe example above is simple enough that we can work it out by hand. This is actually a good idea, because we want to get an idea of what the answer will look like:
TABLE | DESCRIPTION | SEQUENCE ------------+------------------------+--------- ORDERS | Sales Orders Headers | 1 ORDER_LINES | Sales order lines | 2 CUSTOMERS | Customers | 0 ITEMS | Items | 0
This answer should be self-explanatory, except maybe for the fact that both CUSTOMERS and ITEMS have the same value. We need to look at that before we can see the code that produces it. Is it OK that two entries have the same value, and how would our program handle that?
The short answer is that it is perfectly OK and natural for two or more entries to have the same value. All this means is that they can be done in any order relative to each other, so long as they are done before the other entries.
In terms of the example, where we want to build these tables in a database, it means that:
- We would query the list of tables and sort by SEQUENCE
- We would loop through and build each table
- We don't care about ITEMS and CUSTOMERS having the same value, they get built in whatever which-way the server gives us the list.
The same concept applies to the other potential examples: manufacturing, software packages, and generating calculations. So long as you follow the sequence, we don't care about items that have the same value.
Stating the Solution in Plain EnglishWe are now ready to work out a program that will generate the SEQUENCE column. The basic steps the program must perform are:
- Initialize the column to -1. A value of -1 means "Not sequenced."
- Update the column to zero for all items that have no dependencies.
- Repeat the following action until the affected rows are zero: Update the SEQUENCE column to 1 (then 2, then 3) for all rows that have all of their dependencies sequenced already.
- Once the command in step 3 is no longer affecting any rows, check for any rows that have -1, these are involved in circular dependencies and we cannot proceed until the user straightens them out.
The first step is very easy, we initialize the table with this command:
UPDATE TABLES SET SEQUENCE = -1;
The next step is also very easy, we mark with a '0' all of the tables that have no dependencies. The basic idea is to find all of the entries that have no entries in DEPENDENCIES.
UPDATE TABLES SET SEQUENCE = 0
WHERE NOT EXISTS (SELECT child FROM DEPENDENCIES
WHERE child = TABLES.TABLE)
Now for the hard part. We now have to execute a loop. On each pass of the loop we are looking for all items whose dependencies have all been sequenced. We will do this over and over until the command is not affecting any rows. It is important that we cannot exit the loop by testing if all rows are sequenced, because a circular dependency will prevent this from happening and we will have an infinite loop.
You can control this loop from client code, but I wrote mine as a Postgres stored procedure. This algorithm turns out to be surprisingly complicated. The UPDATE command below may not be all that self-explanatory. What it works out is:
- Get a list of child tables from the DEPENDENCIES table
- JOIN through to TABLES to look at the SEQUENCE value of their parents.
- Group and check that the minimum value is greater than zero, if it is it means all parents are sequenced and the table can be sequenced.
- Update the SEQUENCE value for the tables we found
CREATE OR REPLACE FUNCTION zdd.Table_Sequencer() RETURNS void AS
$BODY$
DECLARE
-- Note that rowcount is initialized to be > 0, this makes
-- the loop work properly
rowcount integer := 1;
-- This tracks the value we are assigning to SEQUENCE. We
-- initialize it to 1 because we already took care of the
-- the rows that have value 0
lnSeq integer := 1;
BEGIN
while rowcount > 0 LOOP
UPDATE tables set SEQUENCE = lnSeq
FROM (SELECT t1.CHILD
FROM DEPENDENCIES t1
JOIN TABLES t2 ON t1.PARENT = t2.TABLE
GROUP BY t1.CHILD
HAVING MIN(t2.SEQUENCE) >= 0
) fins
WHERE TABLES.TABLE = fins.CHILD
AND TABLES.SEQUENCE = -1;
lnSeq := lnSeq + 1;
GET DIAGNOSTICS rowcount = ROW_COUNT;
END LOOP;
RETURN;
END;
$BODY$
LANGUAGE plpgsql;
The stored procedure above will stop executing once the UPDATE command is no longer having any effect. Once that happens, your final step is to make sure that all rows have a valid SEQUENCE value, which is to say that no entry has SEQUENCE of -1. If any of the rows have that value then you have a circular dependency. You must report those rows to the user, and you can also report the dependencies that are causing the loop.
ConclusionSequencing dependencies is a fundamental algorithm that has a lot of use cases in database applications. It is easy enough to accomplish, but the innermost UPDATE command can be a little puzzling when you first look at it. Once you have mastered this algorithm you are on the way to the "big leagues" of database applications such as ERP, MRP and others.
Javascript As a Foreign Language
So you know 37 different programming languages, you've programmed moon landers, missiles and toasters, and how could Javascript be any problem? Then you start trying to code up some Javascript and find that it just does not feel right, nothing seems to flow naturally or easily. Your instincts do not seem to guide you. You are not alone, here is your cheatsheet...
Welcome to the Database Programmer blog. If you are trying to write database applications in 2008 then you most likely bump into Javascript. My hope in this week's essay is to provide a "soft landing" into this beautiful and powerful but somewhat strange language.
To see the other essays in this series, consult our Complete Table of Contents.
ContentsToday's essay is rather long. It covers extremely basic ideas but proceeds directly to very powerful techniques. I have provided a summary here so that you can skip over the material that may already be familiar to you.
- Firefox and Firebug
- Execution
- Variable Scope
- Adding Methods to Core Javascript
- Functions as First Class Citizens
- Objects And Classes
- Creating An Object Without A Class
- Accessing Object Properties and Methods
- Iteration
- JSON and Ajax
- Synchronous Ajax: S-JSON
- JQuery and Friends
In case you have been living under a rock for the past few years, let me tell you to do your development in a real web browser, that is, Firefox, and to immediately download the Firebug extension. Firebug more or less does everything you need to debug Javascript, and it has many features you may not even know you need. Do not try to develop Javascript without Firebug.
In particular, firebug has a "console" object that you can send messages to, such as this:
console.log("this is so much better than alert()!");
for(var x = 1; x<10; x++) {
console.log("We are on "+x);
}
Execution
Javascript executes while your page is being loaded, and can be placed anywhere on the page. While I make no claims that the example below is good or bad practice, it does illustrate how Javascript executes.
<html>
<head>
<script>
// Script is executing as it is encountered,
// so this variable comes into existence
// immediately
var x = 5;
function square(x) {
return x * x;
}
// Now that the square function is defined,
// we can call it
var y = square(x);
</script>
</head>
<body>
<h1 id='h1'>Here is a Javascript Example!</h2>
<script>
// Script can be embedded directly in the
// body of your HTML (for better or worse!)
var h1 = document.getElementById('h1');
h1.innerHTML = 'I changed the H1 content!';
// This function can be used anywhere downstream
function changeH1(newText) {
var h1 = document.getElementById('h1');
h1.innerHTML = newText;
}
</script>
</body>
<div>Here is a div of text</div>
<script>
changeH1("Changing H1 yet again!");
</script>
Variable Scope
Scoping in Javascript is pretty straightforward. If you assign a value to a variable outside of any function it becomes a global. If you explicitly define a variable as "window.x = 5" it becomes a global. If you put the keyword var in front of it before using it it becomes local (and can mask a global variable of the same name). You can use the "var" keyword inside of loops, and many javascript programmers use "var" everywhere. Here is an example.
<html>
<head>
<script>
// We are executing outside of a function, so
// both of these are globals:
var x = 5;
y = 10;
function example() {
// Since a global named 'x' exists, and we do
// not use the "var" keyword, we are re-assigning
// the global variable
x = 7;
// Using the "var" keyword makes a local variable,
// we cannot "see" the global x anymore
var x = 2;
alert(x);
// I can still access the global variable to
// set its value back:
window.x = 5;
alert(x);
alert(window.x);
}
</script>
</head>
Adding Methods to Core Javascript
Javascript lacks certain functions that are very useful to have, such as trimming spaces from strings. One very cool thing about Javascript is that you can directly add these methods to the core language, by adding functions to the "prototype" object of the core classes. Here is how you add a "trim" function to core Javascript.
String.prototype.trim = function() {
return this.replace(/^\s+|\s+$/g,"");
}
x = " abc ";
alert('-' + x + '-'); // the dashes let you see the spaces
alert('-' + x.trim() + '-'); // spaces removed!
When I first saw this trick I dutifully copy-n-pasted it in and it worked, but the syntax looked very perplexing, I could not figure out how to make use of it myself. My brain had not yet wrapped itself around the Javascript mentality. This leads directly to our next concept, that functions are "first class cizitens".
Functions as First Class CitizensYou may have heard that Javascript treats functions as "first class citizens" and wondered, "what does that mean?" The best way to explain it in terms of other languages is that you can create functions on the fly and pass them around like variables. This may be a little hard to grasp, so we will go directly to examples.
// Most languages support this type of function definition
function square(x) {
return x * x;
}
// Javascript gives you a slightly different syntax if
// you like, which can be extremely powerful
var square = function(x) {
return x * x;
}
// The books usually go on to an example like this,
// which frankly did not seem to me to have any purpose:
y = x;
alert( y(5) );
The basic idea to get here is that you can do anything with a function that you can do with a variable. There are multiple uses for this, but we have already seen one, namely, the ability to add a method to a previously created class. This is what we did above when we added the "trim()" method to the base "String" class. This means that our approach to building class hierarchies is very different than in other Object-oriented languages like PHP, Foxpro, Delphi, VB and so forth.
// This example shows two different ways to add methods
// to HTML elements and make them act more object-oriented.
// Method 1, make a function that makes an INPUT read-only
// by changing its style and setting a property. Notice
// the code refers to "this" as if it were part of an
// object, see below to see why that works.
function makeReadOnly() {
this.className = 'readOnly';
this.readOnly = true;
}
// Now attach that function to a DOM element (an HTML INPUT)
var input = document.getElementById('myInput');
input.makeReadOnly = makeReadOnly;
// Some other code can now tell the input to go into
// read only mode:
function changeModes() {
var input = document.getElementById('myInput);
// When this executes, the "this" variable in
// the function will refer to "input"
input.makeReadOnly();
}
There is another way to do this as well, that really illustrates how to make use of Javascript's native abilities:
// Method 2 is to defne the function while adding it
// to the INPUT element.
var input = document.getElementById('myInput');
input.makeReadOnly = function() {
this.className = 'readOnly';
this.readOnly = true;
}
// This code works exactly as it did above
function changeModes() {
var input = document.getElementById('myInput);
input.makeReadOnly();
}
Now that we have introduced this idea, it will come up all over the place in later examples.
Objects And ClassesWhen I first tried to use Javascript I kept looking for the "class" keyword, but it's not there! Believe it or not you use the "function" keyword to create what we would call a class in other languages. Here is an example of how to create and instantiate an object in Javascript:
// Here is a simple PHP class for an
// object that handles a row from a database
class dbRow {
var tableName = '';
var rowId = 0;
function dbRow(tableName,id) {
this.tableId = table;
this.fetchRow(id);
}
function fetchRow(id) {
# ...more code here
}
}
var x = new dbRow('customers',23);
In Javascript we make a function instead of a class:
function dbRow(tableName,id) {
// When the object is instantiated, this
// code runs immediately
this.tableName = tableName;
// We must define a fetchRow function before
// we can actually call it....
this.fetchRow = function(id) {
// some kind of ajax stuff going on here
}
// ...and now we can invoke the function
this.fetchRow(id);
}
// When this command returns we have a new "dbRow"
// object.
var x = new dbRow('customers',23);
Creating An Object Without a Class
We can say Javascript is "purely dynamic", by which we mean you can define anything on the fly, including ojects, even if you have no class definition (er, I mean no "function()" definition...). You can explicitly create an object by enclosing the definition in curly braces. Properties and their values are assigned with "name: value" syntax, separated by commas. Since you can do anything with a function that you can do with a variable, the following is a nifty way to create an object:
var x = {
propertyName: 'value',
otherProperty: 'otherValue',
square: function(x) {
return x * x;
}
// Don't put a comma after the last property!
// It will work in firefox but not in IE!
}
alert(x.square(5));
This syntax is called "JSON" by the way, for "Javascript Object Notation". If you can get comfortable with JSON you can start to code up some really elegant Javascript.
Accessing Object Properties and MethodsYou can hardcode references to an object's properties by using the ".property" syntax, but you can also use variables that hold the name of the property.
// Create an object
var x = {
first: 'Sax',
last: 'Russel',
combine: function() {
return this.first + ' ' + this.last;
}
}
// You can now explicitly access properties
alert (x.first);
alert (x.last);
// But you can also have a variable hold the name
// of the property you want:
var propName = 'first';
alert (x[propName]);
// Objects can be nested to any depth, and you can
// mix hardcoded and variable names. If we had a
// complex data dictionary stored on the browser,
// we might get the caption for a column like this:
var tableId = 'customers';
var columnId = 'total_sales';
var caption = dd[tableId].columns[columnId].caption;
This works also for functions. Assuming the same object as the above, we can invoke functions that are named by other variables:
var x = { .... repeated from above example };
var methodName = 'combine';
alert( x[methodName]() );
Iteration
As a database programmer I write a lot of code that iterates arrays and associative arrays. Iteration tends to be very important to database programmers, as it is the most natural way to loop through rows retrieved from a database, or to loop through the values in a row. Basic iteration of an array looks like this:
// make an array
var myList = [ 'sax', 'anne', 'nirgal', 'frank' ];
for(var idx in myList) {
// console.log() requires firebug
console.log("Index and value: "+idx+", "+myList[idx])
}
All of the action is in the line "for(var idx in myList)", this structure will loop through the array. On each pass the variable "idx" will contain the array's index number. To actually get the value you need you have to go looking for myList[idx].
Associate Arrays are a very natural data structure for a database programmer, as they are an easy way to represent a single row retrieved from the database. There is no explicit support for associative arrays in Javascript, but this does not matter because you can use an object and get the same results.
// Here is an object
var myObject = {
first: 'Sax',
last: 'Russel',
occupation: 'Physicist'
}
// Now we loop through it like an associative array
for(var key in myObject) {
console.log("The array key is: " + key);
console.log("The value is: " + myObject[key]);
}
JSON and Ajax
Nowadays everybody is jumping into AJAX with both feet. AJAX can be particularly useful to a database programmer, because you can make AJAX calls that return objects (including code), arrays, and database data.
I should note that the term "AJAX" itself means something very precise, being "Asynchronous Javascript and XML", while the example I am about to show contains no XML, so my use of the term is not strictly correct. Nevertheless, many people routinely use the term AJAX to mean any round-trip to the browser that fetches some fragment of information without doing an entire page refresh. While this is regrettable, I'm not going to try to buck that trend here.
That being said, here is a nifty way to use PHP to send a data structure back on an AJAX request:
# THE PHP CODE:
function someAjaxHandler() {
$table = myFrameworkGetPostRetrievalFunction('table');
$id = myFrameworkGetPostRetrievalFunction('id');
$row = myFrameworkRowRetrievalFunction("customers",23);
# This nifty PHP function encodes arrays and objects
# into JSON, very cool
echo json_encode($row);
}
This would be handled in the browser like so:
function someAjaxResponseHandler() {
if (this.readyState != 4) return;
try {
eval( 'window.requestData ='+this.responseText);
}
catch(e) {
alert("The server response was not parsable JSON!");
return;
}
}
Synchronous AJAX and JSON: S-JSON
It is pretty safe to say that the asynchronous nature of AJAX is a powerful part of its appeal. The request is sent and the browser remains responsive to the user until the request comes back. This is especially powerful for fetching things in the background while the user works.
However, in database applications sometimes it is the Right Thing for the browser to stop while fetching data. If a user clicks on [SAVE] on a CRUD screen to save some changes, we actually want the browser to wait until the server tells them that it all went ok (or not). You can do this by setting a flag on your call. I have found this a very powerful approach to writing desktop-in-browser applications:
function JSON(url) {
// Create an object
var browser = navigator.appName;
if(browser == "Microsoft Internet Explorer"){
var http = new ActiveXObject("Microsoft.XMLHTTP");
}
else {
var http = new XMLHttpRequest();
}
// The trick is to pass "false" as the third parameter,
// which says to not go asynchronously.
http.open('POST' , url, false);
http.send(null);
// Execution now halts, waiting for the complete
// return value
// Once execution resumes, we can capture the
// JSON string sent back by the server and do anything
// we want with it
try {
eval( 'window.requestData ='+http.responseText);
}
catch(e) {
alert("The server response was not parsable JSON!");
return;
}
// Processing of the result occurs here...
}
jQuery and Friends
Nowadays we have a growing list of very powerful Javascript libraries available. Some of them are very high level and some of them are low-level.
One library I will mention by name as being very useful is jQuery. This library provides a wealth of extremely simple and powerful abilities for finding and manipulating the HTML that is in your document. I highly recommend it.
Closing ThoughtsAny database programmer working in 2008 is either already required to use Javascript or may find himself facing it soon. Javascript is very flexible and powerful, but is different from the languages we are used to for writing applications, like PHP, Foxpro, Delphi, VB and others.
Nevertheless, Javascript can do everything you need it to do, you just have to grasp the "javascript mentality" as it were. I have attempted this week in this essay to put into one place all of the facts and tricks that were not so obvious to me from reading books or simply taking a snippet of code and trying to modify it. I hope that you find it useful!
Different Foreign Keys for Different Tables
A foreign key can be used to implement table design patterns that span multiple tables. By choosing how a foreign key handles a DELETE attempt on the parent table, you can structure your table designs to follow two standard patterns.
Welcome to the Database Programmer blog. This series of essays is for anybody who wants to learn about databases on their own terms. There is a complete Table of Contents, as well as a summary of Table Design Patterns. There is a new essay in this spot each Monday morning.
A Simple Example of Two Foreign KeysPicture a basic shopping cart, with its two basic tables of CART and CART_LINES (or ORDERS and ORDER_LINES if you are more old-fashioned). The table CUSTOMERS is also in there as a parent to CARTS. Our three tables look something like this:
CUSTOMERS
|
|
/|\
CART Cart is child of customers
|
|
/|\
CART_LINES Lines is child of Cart
There are two foreign keys here. CART has a foreign key to CUSTOMERS, and CART_LINES has a foreign key to CART, but the two foreign keys should behave very differently.
Table Types and Table Design PatternsIn A Sane Approach To Choosing Primary Keys we saw that table design begins with identifying the basic kinds of tables: Reference and Small Master Tables, Large Master Tables, Transactions, and Cross-References. Just as we picked different kinds of primary keys for the different tables, so will we pick different kinds of foreign keys between these tables.
Deleting a CustomerImagine you have a customer who has made 10 orders in 2 years. A system administrator, who is allowed to basically do anything, goes into your admin screens, looks up the customer, and clicks [DELETE]. What should happen?
The near-universal answer is that the user should be denied the action. An error should come back that says "That customer has orders, cannot delete." We want it this way because we never want to delete any parent row and "orphan" the child rows. Database programmers know from long experience that if you allow the DELETE, your queries will give incorrect results, or you will work extremely hard with lots of weird LEFT JOINS and UNIONS trying to get them to come back correctly.
This is not an issue of "flexibility", where a more robust system would allow the deletion. This is a basic question of record-keeping. If the customer has orders on file then the customer must be kept on file. Enforcing this rule keeps code clean and simple, and trying to avoid this rule in the name of "flexibility" just makes heaps of work for everybody.
Going further, the administrator in question, who supposedly can do anything, may not violate the rule. An administrator is simply somebody who can do anything that would not produce bad data. Administrators should not be given the ability to violate the basic structure of the data, they simply have full rights to do anything within the structure of the data.
The DELETE RESTRICT Foreign KeyThe behavior we want here is called DELETE RESTRICT. On most database servers this is the default behavior for a foreign key. It means that you cannot delete a parent table row if there are matching rows in the child table.
The DELETE RESTRICT pattern is almost universally used when the child table is a transaction table and the parent table is a master table or reference table.
The syntax looks something like this:
-- Most database servers implement DELETE RESTRICT
-- by default, so this syntax:
Create table CART (
customer integer REFERENCES customers
,order integer.....
)
-- ...is the same as this explicit syntax:
Create table CART (
customer integer REFERENCES customers
ON DELETE RESTRICT
,order integer.....
)
Deleting An Order and DELETE CASCADE
Now let us say a staff member is on the phone with a customer, enters an order, enters five lines, and then the customers says "forget it" and the user needs to delete the entire order from the CART.
In this case the user wants to go delete the order, and he expects the computer to also delete the lines. This makes perfect sense, why keep the lines if we don't want the order?
It may seem strange that in the case of deleting a customer it makes perfect sense to stop the user, but when deleting an order it makes perfect sense to delete the lines as well.
The difference is that an entry in the CART table is a transaction entry. When a user deletes a transaction they almost always want to automatically delete all of the relevant rows from all child tables as well. The two rules basically are:
- The user cannot delete a master entry that has transactions.
- Deleting a transaction means deleting the entire transaction.
NOTE: By "transaction" here I mean financial transaction or other interaction between master elements. I do not mean a database transaction.
The syntax for DELETE CASCADE looks something like this:
-- if the user deletes a row from CART,
-- do them the favor of deleting all of the
-- lines as well
Create table CART_LINES (
order integer REFERENCES CART
ON DELETE CASCADE
,order_line integer....
)
Conclusion: Different Tables Types, Different Foreign Key Types
I have said many times in these essays that the foreign key is the only meaningful way to connect data in different tables. This week we have seen that the kind of foreign key you choose depends on what kind of tables you are connecting together. Children of master tables generally get DELETE RESTRICT, and children of transaction tables generally get DELETE CASCADE.
History Tables
A history table allows you to use one table to track changes in another table. While the basic idea is simple, a naive implementation will lead to bloat and will be difficult to query. A more sophisticated approach allows easier queries and can produce not just information about single rows, but can also support aggregrate company-wide queries.
This week in the Database Programmer Blog we return to table design patterns with an essay on history tables. The basic premise of this blog is that good coding skills do not lead magically to good database skills -- you can only make optimal use of a database by understanding it on its own terms. There is a new essay each Monday, and there is a Complete Table of Contents and a List of Table Design Patterns. What to Put Into A History Table
Naive approaches to history tables usually involve making a complete copy of the original (or new) row when something changes in the source table. This turns out to be of little use, for reasons I will explain below. A much more useful approach is to track only a few columns and to store any combination of old values, new values, and differences. A history table designed this way can be tremendously useful.
We will start with the example of a sales order table, called ORDERS. The columns we are interested in might look like this:
ORDER | CUSTOMER | DATE | LINES | TAX | TOTAL | PAID | BALANCE ------+----------+----------+--------+-------+--------+--------+--------- 1234 | 9876 | 5/1/08 | 48.00 | 5.00 | 53.00 | 0 | 53.00 2345 | 9876 | 5/3/08 | 150.00 | 0 | 150.00 | 150.00 | 0 3456 | 5544 | 6/8/08 | 25.00 | 2.60 | 27.60 | 15.00 | 12.60 4567 | 3377 | 7/3/08 | 125.00 | 7.00 | 132.00 | 50.00 | 82.00
We first have to ask which columns must be copied into history so that we can link the history table back to the ORDERS table. The only column we need for tracking is ORDER (the order number), so the history table will always have an ORDER column.
We should also assume that the history table will contain at least a timestamp and a column to track the user who made the change, which brings us to a minimum of three columns.
Finally, it tends to be very useful to track what action caused the history entry, be it an INSERT, UPDATE, or DELETE. This brings us up to four minimum columns.
Next we ask which columns we will definitely not need. There are two groups of columns we will not need, which are 1) the columns that never change and 2) the columns we do not care about. Columns that do not change are likely to be the CUSTOMER and the DATE column. There is no need to bloat the history table with these valus because we can just get them out of the ORDERS table. The second group, columns we do not care about, are are usually things like ship-to address, maybe an email, and other information. Naturally there is no hard-and-fast rule here, it depends entirely upon the needs of the application.
So now we know what we definitely need and what we definitely do not need, and we are ready to begin work considering the columns that will change. Not surprisingly, these are usually all about the numbers. Next we will see how to track the numbers.
Tracking Changes to NumbersWhile it is certainly useful to store one or both of the old and new values for a number, it far more useful to store the change in the value, or the delta. Having this number in the history table makes for some really nifty abilities. If you store all three of the old, new, and delta, then you can more or less find out anything about the ORDER's history with very simple queries.
So we are now ready to consider what the history table might look like. We will take the case of an order that was entered by user 'sax', updated twice by two other users, and in the end it was deleted by user 'anne'. Our first stab at the history table might look like this:
ORDER | USER_ID | ACTION | DATE | LINES_OLD | LINES_NEW | LINES_DELTA ------+----------+--------+---------+-----------+-----------+------------- 1234 | sax | UPD | 5/1/08 | 0.00 | 48.00 | 48.00 1234 | arkady | UPD | 5/7/08 | 48.00 | 58.00 | 10.00 1234 | ralph | UPD | 6/1/08 | 58.00 | 25.00 | -33.00 1234 | anne | DEL | 6/4/08 | 25.00 | 0.00 | -25.00
I should note that if you keep LINES_OLD and LINES_NEW, then strictly speaking you do not need the LINES_DELTA columns. Whether or not you put it in depends on your approach to table design. If you framework allows you to guarantee that it will be correct, then your queries will be that much simpler with the LINES_DELTA column present.
You may wonder why there is no entry for the original INSERT. This is because you must enter an order before you can enter the lines, so the original value will always be zero. Only when lines start going in does the ORDER get any numbers. This is true for header tables, but it would not be true for detail tables like ORDER_LINES_HISTORY.
Some of the Obvious QueriesThere are few obvious queries that we can pull from the history table right away. These include the following:
-- Find the value of of the line items of an -- order as of June 1st SELECT LINES_NEW FROM ORDERS_HISTORY WHERE ORDER = 1234 AND DATE <= '2008-06-01' ORDER BY DATE DESC LIMIT 1; -- Find the original value of the line items, -- and the user who entered it. SELECT LINES_NEW, USER_ID FROM ORDERS_HISTORY WHERE ORDER = 1234 ORDER BY date LIMIT 1; -- Find the users who have worked on an order SELECT DISTINCT USER_ID FROM ORDERS_HISTORY WHERE ORDER = 1234;
Most of queries should be pretty obvious, and there are plenty more that will suggest themselves once you start working with the history tables.
Queries Involving the DeltaThe real power of the DELTA column comes into play when you are trying to compute back-dated values such as the company's total open balance on June 1, 2008. If you have a naive history table that stores only the old value or only the new value, this is truly a tortuous query to write, but if you have both then it is really quite easy.
-- Query to calculate the total open balance of all -- orders as of a given date SELECT SUM(BALANCE_DELTA) FROM ORDERS_HISTORY WHERE DATE <= '2008-06-01';
This magical little query works because paid orders will "wash out" of the total. Consider an order that is entered on May 20 for $200.00, and is then paid on May 23rd. It will have +200 entry in the BALANCE_DELTA column, and then it will have a -200.00 entry 3 days later. It will contribute the grand sum of zero to the total.
But an order entered on May 25th that has not been paid by June 1st will have only a +200 entry in the BALANCE_DELTA column, so it will contribute the correct amount of $200.00 to the balance as of June 1st.
If the company owner wants a report of his total open balances on each of the past 30 days, you can retrieve two queries and build his report on the client:
-- Get begin balance at the beginning of the period SELECT SUM(BALANCE_DELTA) as BEGIN_BALANCE FROM ORDERS_HISTORY WHERE DATE < '2008-06-01'; -- Get the total changes for each day. When you -- build the report on the client, add each day's -- change amount to the prior day's balance SELECT SUM(BALANCE_DELTA) AS BALANCE_DELTA FROM ORDERS_HISTORY WHERE DATE BETWEEN '2008-06-01' AND '2008-06-30' GROUP BY DATE;Keeping History Tables Clean
A clean history table is one that contains no unnecessary information. You normally do not want entries going into the history table if nothing relevant changed. So your history table mechanism should examine the columns it is tracking, and only make an entry to the history table if one of the columns of interest actually changed.
Problems With The Naive History TableA very basic history table will usually copy the entire original row from the source table into the history table whenever an INSERT, UPDATE or DELETE occurs. One simple problem is that you end up with bloated history tables. Because they are cluttered with unnecessary repititions, they are difficult to work with by inspection.
A much more serious technical problem with the naive approach is that it is horribly difficult to produce the queries demonstrated above. You must reproduce the concept of a delta by either running through all of the rows on the client, or you must make a difficult (and often impossible) JOIN of the history table to itself in which you connect each row to the row that came just before it. All I can say is, no thanks, I'll go with the delta.
History Table SecurityHistory tables always involve some concept of auditing, that is, keeping track of user actions. This means we need to protect against deliberate falsification of the history tables, which leads to two rules. First, a user must have no ability to directly DELETE rows from the history table, or they could erase the record of changes. Second, the user must have no ability to directly INSERT or UPDATE existing rows, because if they could they can falsify the history. These rules apply to both regular users and system administrators, the administrator must have no privelege to subvert or manipulate the history.
Since history tables have a tendency to become seriously bloated, there must be some priveleged group that can DELETE from the history tables, which they would do as a periodic purge operation. This group should have no ability to UPDATE the tables, because such priveleges would open a potential hole for subverting the history. Regular system administrators should not be in this group, this should be a special group whose only purpose is to DELETE out of the history tables.
If you are making use of DELTA columns, then stricly speaking you do not want to purge, but compress history tables. If you want to purge out all entries in 2005, you must replace them with a single entry that contains a SUM of the DELTA columns for all of 2005.
So to sum up, we have the following security rules for a history table:
- No system user should be able to DELETE from the history table.
- No system user should be able to UPDATE the history table.
- No system user should be able to directly control the INSERT into the history table.
- A special group must be defined whose only ability is to DELETE from the history table, so that the tables can be purged (or compressed) from time to time.
As always, you have your choice of implementing the history mechanism in the client code or in the database itself.
The best performing and most secure method is to implement history tables with triggers on the source table. This is the best way to implement both security and the actual business rules in one encapsulated object (the table). However, if you have no current practices for coding server-side routines, or you do not have a data dictionary system that will generate the code for you, then it may not be practical to go server-side for a single feature.
Implementing history tables in code has the usual benefit of keeping you in the language and habits you are most familiar with, but it means that you cannot allow access to your database except through your application. I cannot of course make a general rule here, this decision is best made by the design team based on the situation at hand and anticipated future needs.
ConclusionHistory tables have many uses. Beyond the obvious first use of finding indidivual values at some point in the past, well crafted tables can produce company-wide aggregations like total open balances on a given day, changes in booked orders on a day or in a range of days, and many other queries along those lines. Security is very important to prevent history tables from being subverted.
The Wonderful Awful Browser
When a desktop programmer tries to write database applications for the browser, he faces a great many challenges, both technical and cultural. Both sets of challenges appear because the browser and the web were invented for purposes different than our own. On the technical side we must reinvent huge amounts of functionality that we got "for free" with the old desktop systems of Foxpro, Delphi, VB Classic and so on, and on the cultural side we must wade through mountains of irrelevant or downright damaging advice that is aimed at people working on the next version of Facebook or eBay. In this essay we look at as many of these challenges as I can muster.
Why A Desktop Developer Would Move to The WebWhen the browser first appeared, it totally lacked the technical powers required to replace desktop applications. Nevertheless, some programmers immediately began to ponder how to move into the world of the browser. The reasons were simple then, are simple now, and have not changed:
- Far easier deployment -- nothing to install.
- Worldwide access -- businesses with multiple locations are suddenly much easier to take care of.
- You could now create a public website and give customers and vendors limited access to certain information.
- Operating System independence. This is far more of a reality now than we dreamed it might be in the darkest days of Microsoft's Total World Domination, but there were visionaries early on who saw the possibilities.
So there are many programmers, and I am one of them, who continue to work on the same kinds of applications we did before the web existed, but who now deploy these applications in the browser, for the reasons listed above. Here now is our tale of woe and sorrow!
The Cultural DivideWhile desktop programmers were scratching their heads and trying to figure out how to fit into this new world, a new generation of programmers was growing up who were perfecting this new platform and developing applications that were undreamed of before. Unfortunately, some of the good advice they dreamed up is either irrelevant or counter-productive to the database programmer who is deploying to the web.
The driving reality for the database application programmer is that her users are not surfing. They are using a dedicated program written for the purposes of the business they work for. Most of what the browser can do is either not necessary or positively in their way, and the browser lacks productivity tools that they took for granted in "the old system." This fact is central to the cultural divide between application programmers and web programmers.
The Infamous Back Button ProblemIf I surf over to www.osnews.com and click on an article, when I am finished I click "BACK" once or twice until I'm back to osnews, and then pick another article. But to the application user, who is not surfing but is using a dedicated program, who has clicked "New Patient", typed in the info, and clicked "Save", the back button is a positive menace. It is misleading and dangerous. This has led to who-knows-how-many rants from web programmers telling application programmers, "You don't understand the web, you shouldn't write it that way," in which the desparate application programmer replies only, "but you don't understand, I must have it work this way." The simple fact is that when a user is modifying data in a browser there is no concept of BACK. There may "UNDO" or "REVERT", but once the data is saved it is saved. This is why application programmers resort to trying to hide or disable the button, or why they think they should be able to modify the history (which of course they cannot do because that would be a huge security hole for public sites).
Ajax only Makes The Back Button WorsePicture a user on the customers screen, who then goes to the menu and picks the vendors screen. They work for five minutes on the vendors screen, and their wonderful snappy AJAX application is fetching search results and navigating from row to row and saving changes. Then they decide they made a mistake and hit [BACK] and wham! they're on the customer screen! It seems that the better the applications become, the worse the [BACK] button becomes. In my own shop we have finally decided to have the login program pop up a new window which does not have the [BACK] button or the address bar. This is considered heresy by web programmers (you don't understand the web! they cry) but of course what is true for them is not true for us, and vice-versa.
This also leads to much work. We must provide for such features as UNDO with no native support in the browser, and worse, with whatever native support the browser does have been intended for something totally different.
Your Application Must Work Without JavascriptThis is dead wrong for the application programmer. Application programmers have a power that is totally outside the experience of a pure web programmer: we can dictate system requirements to the customer. This led to many unhappy problems before the web, but with Firefox (and firebug!) we now have a platform that is free and robust. We simply install Firefox (or instruct the IT department to do so) and we have a platform that we know will support our application.
Keyboard ShortcutsNothing illustrates the divide between the web and the desktop like keyboard shortcuts.
When Windows 95 swept the office world (but before the web really came into its own), programmers developed a new term for applications that required constant use of the mouse: we called them "mousetraps". The worst kind of mousetrap program requires the user to constantly lift their hand from the keyboard and go to the mouse, then back again. This is fatiguing, confusing, and terribly counter-productive for the end user.
But the real problem is that the browser was born a mousetrap. From the perspective of the desktop programmer, keyboard shortcuts are clearly an afterthought, a "red-headed stepchild" as they say. Native HTML supports only the ACCESSKEY attribute, and recently Firefox was changed so that the default key combination is CTRL+ALT instead of ALT. This small change led me to finally realize that these folks, to put it mildly, have never lived in my world and haven't the slightest clue what my users need. I could expect no help from them on this front.
The solution for the web programmer is to remember my users are not surfing, they are using a dedicated program. Therefore it is the Right Thing for the application programmer to hijack the CTRL-N key and have it mean "New Patient" (or New Customer, New Vendor, etc) instead of opening a new browser window. Moroever, he must kill the CTRL-N so that it does nothing on a page where there is no [New] button. If he does not, then sometimes CTRL-N will create a new patient, and sometimes it will pop up a new window with my.yahoo.com! So the application programmer confidently rewires the "standard" browser keys and has happier customers for his effort.
Technical Problems In the Browser No Default FocusHave you ever gone to a website where the first thing you must do is log in, but the user id input does not have focus? That is a sure sign that the page was written by a web programmer with no desktop experience. When you put a database program into the browser, you expect the user to be typing constantly, looking things up, adding information, and so forth. So the application programmer must ensure that his first control always receives focus. Call it petty if you like, but without it your program becomes a mousetrap. Perfection comes by concentrating on these small things that either annoy or please users.
Tabbing Off to MarsDefault browser behavior is to allow the user to TAB through controls in the order they were created. This can be modified by explicitly assigning TABINDEX attributes to the control. However, when you get to the last control, the browser then Tabs you up to the menu, or the address bar, or anywhere else.
In a business application, where the user is not surfing the web, this is wrong. Tabbing out of the content area is equilivant to exiting the application, it throws the user into a context that they do not need and (sad to say) do not understand.
When I first began deploying business apps in the browser I would get calls saying, "it's frozen" or "i'm typing and nothings happening" and other such mysterious claims. Once I observed the users I realized they were "tabbing to Mars", the focus was up on the menu or in the address bar or somewhere else equally irrelevant. So we created the idea of the "Tab Loop", so that when the user hits TAB on the last control it loops back to the first. This completely ended those calls.
The Tower Of BabelDesktop programmers have a luxury undreamed of by web programmers: they can do all or nearly all of their programming in a single language, like classic Visual Basic, Foxpro, or Delphi (or heck, COBOL or 4GL!). Most of these programmers also know SQL, but it is not seen as a burden to learn it, it is just part of the job.
But when the application programmer moves to the web, he is confronted with at least four systems he must grasp if he is to perform as a master crafstman: (X)HTML, CSS, Javascript, and one or more server languages like Ruby, PHP, Java, etc. These different technologies all have syntaxes and philosophies that are different from each other and from past experience. All I can say is I'm glad I made the effort but I sure as heck hope I never have to make a change that dramatic again.
Let's Not Talk About StateWhen an application programmer moves to the web, he is confronted with the totally alien concept that he cannot maintain state. This idea has been discussed much in past years, and I suspect it may not be the problem it once was, as most of us have long since gotten past it. I did not want to leave it out completely, but it is way too large an issue to discuss in a paragraph and I doubt anyway that I could add to the wisdom that is already out there.
Lousy WidgetsThe HTML SELECT element stinks. Every serious application programmer either downloads a replacement or writes his own replacement. In the old world of the desktop we did not have to do such things.
The final piece of the puzzle in my shop was jQuery. The irony of jQuery is that it seems to me its core function is DOM traversal and manipulation, but its elegant simplicity has drawn creative minds to do things like create really nice widgets for entering time. In my shop we finished off our desktop-in-browser framework by using jQuery extensively, this solved lots of our lousy widgets problem.
Salvation In the New Javascript Frameworks?I should mention the new frameworks that are emerging for desktop-in-browser, such as extJS and Dojo, not to mention of course the Yahoo! User! Interface! Library!.
In my case I set out on the task of doing browser-based applications four years ago, when none of these technologies existed. So I developed my own simple browser-side framework. jQuery let me round off the rough edges, and now I have no need of a third-party framework. So I am afraid I cannot offer any experience in the use of these others.
But even so, if I could use somebody else's dedicated work and put my efforts elsewhere, that would only be wise, so I continue to watch them closely.
When Do Application Programmers Accept Advice From Web Programmers?Does all of this mean that we application programmers can learn nothing from web programmers? Of course not, that we be arrogant in the extreme. The answer is that when we enter the world of the public website, we must seek the advice and guidance of the experts in that world. So when I write the public portion of the application, the part that is visible to customers, vendors and other trading partners, I have to follow the standards of common practice: the back button has to work, Javascript should be optional, it should work on IE (arg), and so on and so forth.
Conclusion: When Worlds CollideThe desktop application programmer who decides to deploy business applications in the browser will face two broad challenges: techical and culturual, both of which stem from the origins of the web, which are so different from the origins of the desktop. The technical challenges tend to center around features of the browser that are either lacking or downright counter-productive, and this is made worse because the advice a programmer will receive comes from a culture whose goals are very different from his own. The database programmer who wants to deploy applications in the browser must be prepared to reproduce a lot of features we took for granted in the desktop, and he must also be prepared to filter through the received wisdom and throw out anything that does not meet the needs of his end-users.
Addendum: After reading a few comments on ycombinator.com I should probably stress that the job of getting the application into the browser is completely doable (in fact I've done it myself). The browser can now easily handle the job of desktop applications. The OP lists the hazards I and many other have faced and overcome in getting there. A big part of the conclusion is that it is a lot easier to get done if you recognize why it seems so hard: which is that you may be getting advice from people whose goals are very different form yours.
Database Performance: Pay Me Now or Pay Me Later
Many database performance decisions come down to "pay me now or pay me later." Some decisions will produce faster inserts and updates at the cost of slower and more complex reads, while other decisions will slow down inserts and updates but provide faster and easier reads.
Welcome to the 30th regular post in the Database Programmer blog, there is a new post each Monday morning. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a that is updated each week, and a that is updated whenever a new design pattern is presented.
Paying With IndexesAn index speeds up SELECT operations. We are not going to go into any detail about how indexes work, this week we will stick to how indexes affect performance.
Imagine a table of 10,000 sales orders. You wish to pull out a handful of fields for orders placed on 5/1/08, so you issue this SELECT:
SELECT customer,order_total FROM orders WHERE date = '2008-05-01'
If you do not have an index on the table, then the database server will have to scan every single row in the table to find the rows that match the WHERE clause. On the other hand, if you had an index on the date column, the server would first read the index to find pointers to the correct rows, and then read only the rows you needed. The index itself is optimized by various methods so that only a very few reads are necessary to find the correct values. Most databases support the following syntax:
CREATE INDEX some_unique_name ON orders (date)
As far as performance goes, an index will slow down write operations (INSERT, UPDATE, and DELETE) because the index must be updated when the write operation occurs. This cost is on top of the write operation itself. (Addendum added July 7: Jochen points out correctly in his comment that this statement oversimplifies things. While it is true that the index must be updated for writes, the index can also dramatically speed up UPDATE and DELETE operations if those operations use a WHERE clause that can benefit from the index.)
In terms of "pay me now or pay me later", when you regularly add a lot of indexes you are opting to "pay me now." You pay the price of slower writes to get faster reads. If you regularly avoid adding any indexes you are opting to "pay me later." You defer the costs of access to read time to get faster writes.
I should note that it is not possible to completely avoid indexes, nor is there any value in trying to. For instance, a primary key requires an index because otherwise you have to scan the entire table every time you do an INSERT, which is just plain crazy. Foreign keys benefit from indexes as well for similar reasons.
Paying With ViewsA "view" is a stored SQL statement that you can SELECT from as if were a table. Imagine we have a table of TEACHERS and a table of COURSES that they are teaching in a particular year. We often need to display a list of courses with the names of the teachers. We can do this with a JOIN, but a view gives us an easier pre-defined way to do this:
CREATE VIEW courses_teachers AS
SELECT courses.room,courses.period,courses.teacher
,courses.year
,teacher.first_name,teacher.last_name
FROM courses
JOIN teachers ON courses.teacher = teachers.teacher
...which now lets you do the easier SELECT:
SELECT * FROM courses_teachers WHERE year='2008';
In terms of "pay me now or pay me later" a view is always a "pay me later" decision. It makes for easier coding but the server must go out on every SELECT and gather together the data required.
The "pay me later" nature of a VIEW meets its greatest extreme when the view contains aggregations. Consider the following view which gives you easy access to customers and their lifetime history of orders and payments:
CREATE VIEW customers_extended AS
SELECT customers.*
,SUM(orders.order_total) as orders_total
,SUM(invoices.balance) as balance
FROM customers
JOIN orders ON customers.customer = orders.customer
JOIN invoices ON customers.customer = invoices.customer
-- Pulling from the VIEW requires a complete read
-- of relevant ORDERS and INVOICES tables
SELECT * FROM customers_extended
WHERE customer = X;
This view is a "pay me later" proposition because every time you issue a SELECT from the view, it will have to scan many rows from the ORDERS and INVOICES tables. The contrasting method is to denormalize which is a "pay me now" approach.
Paying With DenormalizationDenormalizing means taking a normalized database and deliberately inserting redundant values. I have an essay on the three Denormalization Patterns that I use myself, which follow these three forms:
- FETCH operations, where a value such as an item's price is copied from the ITEMS table into the ORDER_LINES table.
- EXTEND operations, where you take the QTY and the PRICE columns in the ORDER_LINES table and write the EXTENDED_PRICE.
- AGGREGATE operations, such as writing the total of ORDER_LINES onto the ORDERS table.
All of these operations fall into the "pay me now" category. When these denormalized columns are put into tables, they add to the the size of the table and increase the cost of write operations. However, when it comes time to SELECT out of the tables the values are all there ready to go, usually with fewer JOINs and lower overall disk activity.
Extreme Pay Me NowIn my line of work I deal with line-of-business programs that are commissioned or purchased by businesses to do their daily work. User counts are low and resources are high, because often I will have 10 users on a single server, with access via internet limited to only a few thousand potential customers of which very few are ever on at the same time.
In this context, I prefer to take the "pay me now" approach to its fullest realization. This means I tend to design my systems so that:
- Any column a user is likely to filter on has an index.
- Tables are fully denormalized, containing a wealth of derived values.
This means that all write operations on my systems are slower than they might otherwise be. However, this is more than acceptable within this context because the server is largely untaxed, and users do not notice the difference between 100ms and 200ms to save a row. So I can pay when the user does not notice and as reward I have very rich reporting and lookup abilities.
The extreme pay-me-now approach has one more advantage. The wealth of derived values in the database lets end-users find what they are looking for without calling a programmer and asking for a special page or report. Generally the more derived values there are the truer this becomes.
Extreme Pay Me LaterThe extreme form of pay-me-later is a fully normalized database with no derived values and a minimum of indexes. Calculated values are available either in views, client-side code or both. This type of database is tuned for lots of fast writes because the cost of an INSERT or UPDATE has been kept to an absolute minimum. The database will be slower to perform ad-hoc or one-off queries because the server will have to do table scans whenever a user filters on anything except primary keys and foreign keys.
The lack of derived values in fully normalized databases also leads to more phone calls and emails asking the programmer to create a report or page that will work out derived values that are not present in the database.
Conclusion: Know Your ContextThis week we have taken common database technologies such as indexes and views and seen how they affect performance. All of these technologies can be judged in terms of the "pay me now or pay me later" decision.
Database programmers normally choose to "pay me later" when they must support a large number of simultaneous write operations with a minimum of contention. These situations call for fewer indexes and strict normalization. The trade-off is that ad-hoc or one-off queries will involve more JOINs, more table scans and an increased likelihood the programmer will be called in for special cases.
When read operations are more common than writes, or when inquiries and reports are likely to be unpredictable, database programmers will choose to "pay me now" by doing more work on the write operation. There will be more indexes and more denormalized values, so that the user is more likely to quickly locate whatever they want without programmer intervention.
Database Performance: The Web Layer
A database application is a like a convoy of ships, it is only as fast as the the slowest ship. The three "ships" in a web-based database application are the database itself, the web layer, and the browser. Today we will continue our series on performance by examining how the web layer can efficiently retrieve data from the database.
Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a that is updated each week, and a that is updated whenever a new design pattern is presented.
Cost 1: Round TripsThe first basic cost of retrieving data is the "round trip". Database programmers speak of a "round trip" as occurring whenever you send a request the server and retrieve some results. Each round trip to the server carries some overhead, as the server must do some basic work to allocate and release resources at the start and end of the request. This overhead is added to the base cost the server must pay to actually go out to disk to find and retrieve your data.
If your application makes more round trips than are necessary, then the program will be slower than it could be.
Cost 2: Retrieval SizeEvery byte that the application retrieves from the server carries a cost at several points. The server must go to disk and read it, the wire must carry the load from the db server to the web server, and the web server must hold the result in memory. If your web code regularly retrieves more information than it needs, then the program will be slower than it could be.
This is why you will see advice that you should never use "SELECT *..." in direct queries, because it is near certain that you are retrieving data you will not use. The more desirable query names the exact columns you need so that you have maximum efficiency. This is especially important if your table contains text (aka clob) fields, if you use "SELECT *..." on one of those tables you risk pulling all kinds of data over the wire that is just going to be thrown away.
Example 1: One Round Trip By Using JOINConsider the basic case where you are retrieving and displaying the line items from an order (or shopping cart as people call it these days). Let us assume that you have an ORDER_LINES table that contains SKU, QTY, and PRICE among others. The item's description is in the ITEMS table. To display the lines, you must retrieve each line and also retrieve the item's description.
To do this most efficiently, we can make a single round trip to the server that retrieves all of the columns we need in one shot, then do a loop to render them like so (the example is in PHP):
# Assume some function that gives you the order number,
# sanitized for safe substition
$order = GetOrderNumber();
# Form the SQL
$sq="SELECT ol.sku,ol.price,ol.qty,ol.extended_price
,i.description
FROM ORDER_LINES ol
JOIN ITEMS i ON ol.sku = i.sku
WHERE ol.oder = $order";
# Most frameworks should have some command to retrieve
# all rows for a query, something like this:
$lines = SQL_AllRows($sq);
# Finally, render the HTML
foreach($lines as $line) {
#
# HTML rendering code here
#
}
I should stress that this example carries a reasonable expectation that the order is small enough that you don't start hitting the inefficiencies of your particular language. Rendering large results sets in a Web Application is severely problematic compared to the old desktop systems, and doing so requires separate techniques that will have to wait for a future essay.
Example 2: Caching Small TablesSometimes you will need to generate a printed report that involves many tables, including several description lookups. For instance, I have a medical application that generates statements for all patients who have a balance. A typical run will produce 100 or 200 statements, and each statement requires information from no less than 8 tables.
In cases like this you can simplify your queries by retrieving the small lookup tables in their entirety before going to the main query and loop. For the example of the medical program there are two tables that qualify for this treatment. These are the tables of "ICD9" codes and "CPT" codes. Both of these usually have only about 100 rows, and there are only 2 relevant columns in one and 3 in the other. Therefore there is a big gain to be had by simply loading them into RAM ahead of time and simplifying the resulting code.
This bare-bones example shows simply that the tables are loaded first, and then main execution begins.
# The function SQL_Allrows() gives me the complete result from
# a query, the 2nd argument says to return an associative
# array with key values made out of the named column.
# NOTE: an "icd9" code is a medical diagnosis code
$icd9codes = SQL_AllRows(
"Select icd9code,description from icd9codes"
,"icd9code"
);
# NOTE: a "CPT" code is a medical procedure code
$cptcodes = SQL_AllRows(
"select cptcode,description from cptcodes"
,"cptcodes"
);
# ...now continue by pre-fetching the list of patients
# we will be dealing with, and then we can finally
# go into the main loop and refer to the $icd9codes
# and $cptcodes array as needed.
#
$patients = SQL_AllRows(
"Select patient from patients where balance > 0
order by last_name,first_name"
);
foreach($patients as $patient) {
#
# retrieve the statement information, use
# arrays $cptcodes and $icd9codes to display
# descriptions for those codes
#
}
Knowing Your Context
There is one more piece of the puzzle that a programmer must have if he is to make wise decisions when trying to balance round trips and retrieval size. This is a thorough knowledge of your context. Knowing your context can dramatically help in making decisions.
Some examples of context are:
- Huge social networking site or portal with hundreds of hits per second.
- eCommerce site.
- Line of business program used by the staff of a company to do their daily work.
My own context is the third item, line of business applications. In this context the following realities hold:
- A huge user base might be a few hundred, with never more than five or six simultaneous transactions going on.
- A much more common user base is 10-20 users (or even 3 or 4!), with one transaction every 5-20 seconds.
- The public website accessed by customers is limited to a few thousand potential users, of which you rarely if ever have two or more users on at the same time.
In this context I have a wealth of server resources, because my customer can spend as little as $1500.00 and get a server with more RAM than 10-20 users will ever use at the same time. Therefore, my own coding habits often tend toward caching lookup tables and pulling 300 rows into memory at one shot so that I can get them to the screen (or PDF, or CSV...) as fast as possible. But these decisions are guided by the context of my applications, if your context is different, you may be led to different conclusions.
ConclusionIt is not difficult to create database applications that perform well. The basic rules of thumb are to make a minimum number of round trips to the server and to retrieve precisely the values that you need and no more. These ideas work well because they minimize your most expensive operation, which is disk access.
It is also perfectly acceptable to denormalize your tables (following Denormalization Patterns) which simplifies your queries and reduces JOIN operations. Finally, you must know your context well, so that you can evaluate techniques such as caching lookup tables.
These ideas form the cornerstone of most performance optimization and you will find that applying them over and over rigorously will give you most of what you need to keep performance strong in the web layer.
Database Performance 1: Huge Inserts
The modern database server provides a wealth of features that provide robust and reliable storage. Understanding how these features work is vital if you want fast performance for your databases. This week we begin a series on performance by looking at "ACID" compliance and how it affects our handling of large operations.
Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a that is updated each week, and a that is updated whenever a new design pattern is presented.
What is ACID ComplianceThe modern database provides a set of features known as ACID compliance which make the database very robust. To paraphrase the Wikipedia article, ACID compliance means that:
- Each transaction is Atomic. It is completed in its entirety or not at all.
- The database is always Consistent. No user ever sees the intermediate and possibly invalid state of the database while your transaction is in progress.
- Each transaction is isolated. Your changes do not get mixed up with other people's changes, even though they are executing at the same time (see more in the Wikipedia article on Serializability.)
- The transaction is durable. Once the database says the job is complete without errors, you are assured the database has checked all constraints and keys and the transaction is completely valid. In most cases we also take this to mean the data is safely on disk and pulling the plug will not corrupt it.
Consider the case where you are creating a new system that must load 1 million rows into a single table from an older system. You go to your server's manual and find the command to do so (For PostgreSQL it is COPY..., for SQL Server it is BULK INSERT...). You work out the painful details of the command with a test file of 10 rows and get the syntax down. Then you issue the command with the real file of 1 million rows. After a minute or two you realize, well, 1 million rows is a lot, time to get some coffee. Returning with the coffee you see that it is still not finished, so time to check some email. An hour later it is still not done, and when you leave it running overnight and come back in the morning your machine is frozen.
The problem here is simply one of numbers. One million rows of average length of 100 characters (in ASCII or UTF-8) will be about 100 megabytes. The server must do no less than maintain two completely separate states for database -- one state without your rows and one with your rows. The cost of this is several times the actual size of the input data, so the 1 million rows in this example will take several hundred megabytes of resources, at least! The server will be managing this process on both disk and in RAM. This will simply die on a laptop or development workstation, even one with a gig or two of RAM.
You can maybe go out and buy some RAM, but the purpose of this essay is to explain how to deal with those inevitable cases where the operation you are performing requires more resources than you have. This situation will always come up, so it is good to know how to deal with it.
Step 1: Drop Indexes and KeysACID compliance extends to indexes as well. When you INSERT many thousands or millions of rows to a single table in one shot, the server must maintain two separate versions of each index. This burden is laid on top of the burden of calculating the index keys for every single row one-by-one. If we began with a burden of several hundred megabytes of resources, just a few indexes on your table could end up more than doubling that.
This is why you will see advice on mailing lists and forums to drop indexes before doing large insert operations.
Your table will have one index automatically for the primary key, so you must drop the primary key. You will also want to drop foreign keys so that you do not waste time checking them (they also have indexes). Unique constraints also end up creating indexes automatically behind the scenes, so you must drop those. Other constraints must also be dropped to prevent having them checked for every single one of your million rows. Finally, you must drop all indexes you created yourself. After the load is complete, you must recreate these keys, indexes, and constraints.
In some cases, where your load is small enough, this may be enough to get predictable load times, and you can stop here. But the larger the operation, the more likely that this will not be enough. In those cases, there is another step to take.
Step 2: ChunksThe basic problem described above is that the database performance has gone non-linear. When you double the number of rows, it does not take twice as long, but four times (or 3 or 10 or whatever). When you multiply the rows by 10, it may not take 10 times as long, you might see it take 100 times as long, or more! (Or maybe you just killed it after you came back in the morning and your workstation was frozen).
We can restore linear performance if we break the input into chunks and load them one at a time in succession. You break the input into files that are small enough so that no individual file will send the server into non-linear hell. If we find that a chunk of 5000 rows loads in 4 seconds, and we have 2000 of these files, we now have a predictable load time. We have restored linear performance because we know that twice as many rows will take twice as long.
I am currently working on a system where I must occassionally load 3 tables of about 3 million rows each periodically from an old desktop Visual Foxpro system into a Postgres database. The chunking code on the output looks something like this:
* This is FOXPRO code, which is vaguely like BASIC...
mCount = 0
mIncrement = 5000 * Hardcoded chunk size
mRN = Reccount(p_tableName) * Fox's command to get row count
* these three commands get the record pointer to the top
* of the table and turn off all indexes
SELECT (p_tableName)
locate
set order to 0
for rnStart = 1 TO mRN step mIncrement
mCount = mCount + 1
* each loop outputs the next 5000 rows and leaves the
* record pointer ready for the next loop.
* Foxpro uses a semi-colon to mean continue onto next line
COPY column1,column2,column3 ;
TO (m_dir+p_tableName+"_"+PADL(mCount,6,'0')+".asc") DELIMITED ;
WHILE recno() < (rnStart + mIncrement)
* This is foxpro's 'echo' command, a question mark
? p_tableName + " copied "+str(_TALLY)+" records"
endfor
Then on the receiving end I need a program that reads the chunks and loads them in. The relevant portion of the code is here (the example is in PHP and loads to PostgreSQL):
# Assume variable $fcnt holds the number of chunks to load
# and that variables like $tabname, $ddir etc hold file names,
# directory locations, column lists and so forth.
for($c = 1; $c<= $fcnt; $c++) {
$insert = "_".str_pad($c,6,'0',STR_PAD_LEFT);
LogEntry(" loading file # "
.str_pad($c,6,' ',STR_PAD_LEFT)
.' of '.str_pad($fcnt,6,' ',STR_PAD_LEFT)
);
$cmd="COPY $tabname ($collist) "
." FROM '$ddir$afile$insert.asc' DELIMITERS ',' CSV QUOTE '\"'";
# This is my frameworks super-simple direct SQL command
SQL($cmd);
}
Conclusion: Chunks Restore Linear Performance
Database programmers depend heavily on "ACID" features to provide robust data storage. We depend upon these features so much that we will not consider using systems that cannot provide them (MySQL's MyISAM engine for instance). The cost of these features for performance is considered part of the bargain when robustness is required, but when you are doing a huge insert, the ACID features cause performance to go "non-linear", to become unpredictably long. As a first step you can drop indexes, keys, and constraints on a table to improve load times, but if that is not enough, you can restore linear performance by breaking the large operation into many "chunks", each of which is small enough to stay linear.
Why I Do Not Use ORM
An impedance mismatch occurs when two devices are connected so that neither is operating at peak efficiency. This lack of efficiency is not due to any intrinsic incompatibilities between the devices, it only exists once they are connected together the wrong way. Object Relational Mapping (ORM) does not cure a pre-existing impedance mismatch, it creates one, because it connects databases to applications in a way that hobbles both.
Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a Complete Table Of Contents that is updated each week, and a Master list of table design patterns that is updated whenever a new design pattern is presented.
Good Marketing, Terrible AnalogyNormally I like to reserve this space for a positive presentation of things that I have found that work, I don't like to waste time ranting against things I don't like. However, ORM is so pervasive in some circles that it is important to establish why you will not see it used here, so as to avoid a lot of unnecessary chatter about its absence.
The use of the term "impedance mismatch" is great marketing, worthy of a certain software company in the Northwest of the US, but the analogy is utterly wrong. The analogy is used incorrectly to imply an intrinsic incompatibility, but the real irony is that there is no such incompability, and if we want to use the analogy we are forced to say that ORM is the impedance mismatch, because it creates the inefficient connection.
It always comes back to the fact that modern databases were designed to provide highly reliable permanent storage, and they possess a slew of features that promote that end. Programming languages on the other hand are meant to process data in a stepwise fashion. When the two of them meet it is very important to establish a strategy that uses the strengths of both, instead of trying to morph one into the other, which never yields efficient results.
The Very BasicsThe language SQL is the most widely supported, implemented, and used way to connect to databases. But since most of us have long lists of complaints about the language, we end up writing abstraction layers that make it easier for us to avoid coding SQL directly. For many of us, the following diagram is a fair (if not simplified) representation of our systems:
This diagram is accurate for ORM systems, and also for non-ORM systems. What they all have in common is that they seek to avoid manually coding SQL in favor of generating SQL. All systems seek to give the programmer a set of classes or functions that makes it easy and natural to work with data without coding SQL manually.
This brings us to a very simple conclusion: the largest part of working out an efficient database strategy is working out a good SQL generation strategy. ORM is one such strategy, but it is far from the simplest or the best. To find the simplest and the best, we have to start looking at examples.
First Example: Single Row OperationsConsider the case of a generic CRUD interface to a database having a few dozen tables. These screens will support a few single-row operations, such as fetching a row, saving a new row, saving updates, or deleting a row.
We will assume a web form that has inputs with names that begin with "inp_", like "inp_name", "inp_add1", "inp_city" and so forth. The user has hit [NEW] on their AJAX form, filled in the values, and hit [SAVE]. What is the simplest possible way to handle this on the server? If we strip away all pre-conceived ideas about the "proper" thing to do, what is left? There are only these steps:
- Assemble the relevant POST variables into some kind of data structure
- Perform sanity checks and type-validations
- Generate and execute an insert statement
- Report success or failure to the user
The simplest possible code to do this looks something like this (the example is in PHP):
# This is a great routine to have. If you don't have
# one that does this, write it today! It should return
# an associative array equivalent to:
# $row = array(
# 'name'=>'....'
# ,'add1'=>'....'
# )
# This routine does NOT sanitize or escape special chars
$row = getPostStartingWith("inp_");
# get the table name.
$table_id = myGetPostVarFunction('table_id');
# Call the insert generation program. It should have
# a simple loop that sanitizes, does basic type-checking,
# and generates the INSERT. After it executes the insert
# it must caches database errors for reporting to the user.
#
if (!SQLX_Insert($table_id,$row)) {
myFrameworkErrorReporting();
}
Without all of my comments the code is 5 lines! The Insert generation program is trivial to write if you are Using a Data Dictionary, and it is even more trivial if you are using Server-side security and Triggers.
This is the simplest possible way to achieve the insert, and updates and deletes are just as easy. Given how simple this is (and how well it performs), any more complicated method must justify itself considerably in order to be considered.
ORM cannot be justified in this case because it is slower (objects are slower than procedural code), more complicated (anything more than 5 lines loses), and therefore more error-prone, and worst of all, it cannot accomplish any more for our efforts than we have already.
Objection! What About Business Logic?The example above does not appear to allow for implementing business logic, but in fact it does. The SQLX_Insert() routine can call out to functions (fast) or objects (much slower) that massage data before and after the insert operation. I will be demonstrating some of these techniques in future essays, but of course the best permforming and safest method is to use triggers.
Example 2: Processes, Or, There Will Be SQLMany programmers use the term "process" to describe a series of data operations that are performed together, usually on many rows in multiple tables. While processes are not common on a typical website, they are plenty common in line-of-business applications such as accounting, ERP, medical programs, and many many others.
Consider a time entry system, where the employees in a programming shop record their time, and once per week the bookkeeper generates invoices out of the time slips. When this is performed in SQL, we might first insert an entry into a table of BATCHES, obtain the batch number, and then enter a few SQL statements like this:
-- Step 1, mark the timeslips we will be working with UPDATE timeslips SET batch = $batch WHERE batch IS NULL; -- Step 2, generate invoices from unprocessed timeslips INSERT INTO Invoices (customer,batch,billing,date) SELECT CUSTOMER,$batch,SUM(billing) as billing,NOW() FROM timeslips WHERE batch = $batch GROUP BY customer; -- Step 2, mark the timeslips with their invoices UPDATE timeslips SET invoice = invoices.invoice FROM invoices WHERE timeslips.customer = invoices.customer AND timeslips.batch = $batch;
While this example vastly simplifies the process, it ought to get across the basic idea of how to code things in SQL that end up being simple and straightforward.
Counter Example: The Disaster ScenarioThe biggest enemy of any software project is success. Code that works wonderfully on the developer's laptop is suddenly thrown into a situation with datasets that are hundreds of times larger than the test data. That is when performance really matters. Processes that took 3 minutes on the laptop suddenly take 10 hours, and the customer is screaming. How do these things happen?
Mostly they happen because programmers ignore the realities of how databases work and try to deal with them in terms they understand, such as objects or even simple loops. Most often what happens is that the programmer writes code that ends up doing something like this:
foreach( $list_outer as $item_outer) {
foreach( $list_inner as $item_inner) {
...some database operation
}
}
The above example will perform terribly because it is executing round trips to the database server instead of working with sets. While nobody (hopefully) would knowingly write such code, ORM encourages you do to this all over the place, by hiding logic in objects that themselves are instantiating other objects. Any code that encourages you to go row-by-row, fetching each row as you need it, and saving them one-by-one, is going to perform terribly in a process. If the act of saving a row causes the object to load more objects to obtain subsidiary logic, the situation rapidly detiorates into exactly the code snippet above - or worse!
On a personal note, I have to confess that I am continually amazed and flabbergasted when I see blog posts or hear conversations in user groups about popular CMS systems and web frameworks that will make dozens of database calls to refresh a single page. A seasoned database programmer simply cannot write such a system, because they have habits and practices that instinctively guard against such disasters. The only possible explanation for these systems is the overall newnewss of the web and the extreme ignorance of database basics on the part of the CMS and framework authors. One can only hope the situation improves.
Sidebar: Object IDs are Still GoodThere are some people who, like myself, examine how ORM systems work and say, "no way, not in my code." Sometimes they also go to the point of refusing to use a unique numeric key on a table, which is called by some people an "Object ID" or OID for short.
But these columns are very useful for single-row operations, which tend to dominate in CRUD screens (but not in processes). It is a bad idea to use them as primary keys (see A Sane Approach To Choosing Primary Keys), but they work wonderfully in any and all single-row operations. They make it much easier to code updates and deletes.
ConclusionsThe recurring theme of these essays is that you can write clean and efficient code if you know how databases work on their own terms. Huge amounts of application code can be swept away when you understand primary keys and foreign keys and begin to normalize your tables. The next step from there is knowing how to code queries, but sooner or later you have to grapple with the overall architecture. (Well supposedly you would do that first, but many of us seem to learn about architectural concerns only after we have coded long enough to recognize them).
A thorough knowledge of database behavior tends to lead a person away from ORM. First off, the two basic premises of ORM are factually incorrect: One, that there is some native incompatibility between databases and code, and two, that all the world must be handled in objects. These two misconceptions themselves might be excusable if they turned out to be harmless, but they are far from harmless. They promote a willful ignorance of actual wise database use, and in so doing are bound to generate methods that are inefficient at best and horrible at worst.
Overall, there are always simpler and better performing ways to do anything that ORM sets out to achieve.
Addendum June 19, 2008After reading the comments on the blog over the last few days I have decided to put in this addendum rather than attempt to answer each comment independently. I have attempted to answer the objections in descending order of relevance.
The Example is Trivial or "Cheats"This is a very compelling challenge to the article offered by bewhite and Michael Schuerig and it deserves a meaningful response. What I want to do is flesh out my approach and why I find it better than using ORM. While I do not expect this to lead to agreement, I hope that it answers their challenges.
- My sphere of activity is business applications, where two dozen tables is trivial and the norm is for dozens or hundreds of tables.
- When table count beyond the trivial, many concerns come into play that do not appear at lower table counts.
- I have found that a single unified description of the database works best for these situations, provided it can specify at very least schema, automations, constraints, and security. This is what I refer to as the data dictionary.
- The first use of the data dictionary is to run a "builder" program that builds the database. This builder updates schemas, creates keys and indexes, and generates trigger code. The same builder is used for clean installs and upgrades.
- The generated trigger code answers directly the challenges as to how non-trivial inserts are handled. Downstream effects are handled by the triggers, which were themselves generated out of the dictionary, and which implement security, automations, and constraints. No manual coding of SQL routines thank you very much.
- All framework programs such as SQLX_Insert() read the dictionary and craft the trivial insert. The code does what you would expect, which is check for type validity, truncate overlong values (or throw errors). But it does need to know anything more than is required to generate an INSERT, all downstream activity occurs on the server.
- The dictionary is further used to generate CRUD screens, using the definitions to do such things as gray out read-only fields, generate lookup widgets for foreign keys, and so forth. This generated code does not enforce these rules, the db server does that, it simply provides a convenient interface to the data.
- A big consequence here is that there is no need for one-class-per-table, as most tables can be accessed by these 'free' CRUD screens.
- That leaves special-purpose programs where 'free' CRUD screens don't reflect the work flow of the users. In a business app these usually come down to order entry, special inquiry screens and the lot. These can be programmed as purely UI elements that call the same simple SQLX_Insert() routines that the framework does, because the logic is handled on the server.
- This approach is not so much about code reuse as code elimination. In particular, the philosophical goal is to put developer assets into data instead of code.
- When this approach is taken to its full realization, you simply end up not needing ORM, it is an unnecessary layer of abstraction that contributes nothing to quality at any stage.
These ideas are implemented in my Andromeda framework. It is not the purpose of this blog to promote that framework, but it has been successfully used to produce the types of applications I describe on this blog. I make mention of it here for completeness.
So to conclude, both of these gentlemen are correct that the example says nothing about how the crucial SQLX_Insert() routine is coded, and I hope at least that this addendum fleshes this out and makes clear where it is different from ORM.
The Model Should Be Based On Classesbewhite asks "Do you propose us to organize our applications in terms of tables and records instead of objects and classes?"
Yes. Well, maybe not you, but that's how I do it. I do not expect to reach agreement on this point, but here at least is why I do it this way:
- My sphere of activity is business applications, things like accounting, ERP, medical management, job control, inventory, magazine distribution and so forth.
- I have been doing business application programming for 15 years, but every program I have ever written (with a single recent exception) has replaced an existing application.
- On every job I have been paid to migrate data, but the old program goes in the trash. Every program I have written will someday die, and every program written by every reader of this blog will someday die, but the data will be migrated again and again. (...and you may even be paid to re-deploy your own app on a new platform).
- The data is so much more important than the code that it only makes sense to me to cast requirements in terms of data.
- Once the data model is established, it is the job of the application and interface to give users convenient, accurate and safe access to their data.
- While none of this precludes ORM per se, the dictionary-based approach described above allows me to write both procedural and OOP code and stay focused on what the customer is paying for: convenient, accurate and safe access.
- The danger in casting needs in any other terms is that it places an architectural element above the highest customer need, which is suspect at best and just plain bad customer service at worst. We all love to write abstractions, but I much prefer the one that gets the job done correctly in the least time, rather than the one that, to me, appears to most in fashion.
More than one comment said simply that triggers and other server-side technologies "went out". Since I was there and watched it happen I would contend that when the web exploded a new generation came along with different needs. In particular the need for content and document management caused people to question all of the conventional uses of the SQL databases, and like all programmers they are convinced their world is the only world and all of the world, ergo, triggers are history because I don't use them. Nevertheless, those of us who continue to write business applications continue to use the technologies that worked well then and only work better now.
Ken Does Not Like OOPI love OOP, especially for user interfaces. I just don't think it should own the domain model, and I don't think that "trapping" business logic inside of classes gives nearly the same independence as a data dictionary does. I've tried it both ways and I'll stick with the dictionary.
Any Use of OO Code Implies ORMA few comments said outright that if you are using OOP code then you are by definition mapping. Technically this is untrue if you understand the use of the term "map" as opposed to "interface". Mapping is the process of creating a one-to-one correspondence between items in one group (the code) to items in the other (the database). A non-ORM interface is one in which any code, procedural or OOP, passes SQL and handles data without requiring a one-to-one mapping of tables or rows to classes or functions. My apps are not ORM because I have no such requirement that there be a class for every table, and no such requirement that there be any specific code to handle a table.
Don't Talk about Procedural Being FasterAt least three comments blasted this contention. To put things in context, performance in a database application goes in two stages. First and absolutely most critical is to be extremely smart about reducing database reads, since they are 1000's of times slower than in-memory operations. However, once that is done, there is no reason to ignore speed improvements that can be gained by optimizing the code itself. The commenters are correct that this gain is of a very low order, but I would stand by the statement after making this contextual addendum.
Thank You AllThis was the most widely read piece in this series, definitely the most controversial. There will not likely be any other articles this controversial, as the main purpose of this essay was to provide regular readers with some background as to why they will not see ORM-based examples in future essays. Thanks for the comments!
Using a Data Dictionary
Database applications can be made much simpler if you maintain a body of data that describes your tables. Every single programming task in a database application needs to know something about the tables it is working with, so every program in a framework and many programs in an application can benefit from a central store of information about the database.
Introducing the Data DictionaryThe term "data dictionary" is used by many, including myself, to denote a separate set of tables that describes the application tables. The Data Dictionary contains such information as column names, types, and sizes, but also descriptive information such as titles, captions, primary keys, foreign keys, and hints to the user interface about how to display the field.
A super-simple beginning Data Dictionary might look like this:
TABLE | COLUMN | TYPE | PREC* | SCALE | PK | AUTO** |DESCRIPTION ---------+---------+------+-------+-------+----+--------+------------------- students | student | int | - | - | Y | IDENT | Student ID students | firstnm | char | 20 | - | | | First Name students | lastnm | char | 20 | - | | | Last Name students | city | char | 20 | - | | | City students | gpa | num | 2 | 1 | | | Grade Point Avg * Precision: Needed for chars, varchars, and numerics ** automation: to be described more belowThe First Question: Where to Put It
It might seem that the first question about a data dictionary would be "What do we put in it?" We will get to this question in a moment, but the most important question is usually, "Where do we put it?" By this I mean do you put into an XML file, or do you encode it into classes in program code? Or is it somehow directly entered into database tables? There are of course many opinions on this.
The first opinion is that you do not really need a data dictionary per se because you can get this information from the database server (if this is news to you, google "information_schema" and your favorite database). There are two major drawbacks when you depend on the server. First, you cannot build a database out of your data dictionary, because of course you don't have one until the database is built. The second drawback is much worse, which is that you cannot put any extended properties into the dictionary, and without those the dictionary is of little use.
Another method is to put the dictionary into classes, using the typical one-class-per-table approach and storing the data as properties of the class. There are multiple drawbacks to this approach:
- Data that is "trapped" in code is very difficult to deal with efficiently, so many operations with the dictionary will be much harder to code and will be slower.
- The dictionary is spread out into many files.
- Ironically, a good data dictionary allows you to generate most CRUD forms, and so you don't need those one-class-per-table files filling up your directory. It seems silly to make a class that contains data that makes the class unnecessary.
The option I prefer is a plaintext file, which can be generated by a GUI or typed in by hand. The only requirement for the file is that it be easy to type and read, and easy to parse by a program. These requirements are well met by two formats: YAML and JSON (XML is a bletcherous horror to work with manually, so it is disqualified before the race starts). Both YAML and JSON enjoy parsers and writers in nearly all popular languages, so if you create your data dictionary in one of those you have a file that is human readable and writable, machine readable and writable, useful in nearly every language on every platform, easily placed in source control, and very flexible in what it can represent.
First Basic Use: Building and Upgrading DatabasesA data dictionary is a wonderful way to build and upgrade databases. It is a real thrill to write your first SQL generator that scans a table of column definitions, writes out a CREATE TABLE statement, and executes it. The next step in the evolution of this process is to have the program query the INFORMATION_SCHEMA of the database, then work out which columns have been added to your data dictionary that are not in the table, and then upgrade the table with an ALTER TABLE statement.
This basic approach can easily be extended to include indexes and keys.
Sidebar: Data Dictionary Versus Upgrade ScriptsMany programmers use upgrade scripts to alter their schemas. The idea is that programmer Sax Russell adds a feature. Along with his code he writes a script that makes the necessary alterations to the database. Then comes Ann Clayborne who does the same thing, followed by Hiroko Ai, and then Sax again. When a customer upgrades their system, they run all four scripts in order. This can lead to horrible upgrade experiences in cases where multiple scripts are upgrading a large table several times. A data dictionary is far s



