剔除多余的SQL代码(Getting rid of SQL Code)

来源:互联网 发布:简单的html源码下载 编辑:程序博客网 时间:2024/05/16 13:51
Getting rid of SQL Code
20 August 2009
by Joe Celko
from: http://www.simple-talk.com/sql/t-sql-programming/getting-rid-of-sql-code/

 

Joe becomes intrigued by the way that experts make errors in any area of technology, and suggests that the problem is more that of mindsets than lack of knowledge. He illustrates the point with SQL Development by means of the "Britney Spears, Automobiles and Squids" table, and  the tangled Stored procedure, and shows ways of getting rid of both procedural and non-procedural code by adopting a different programming mindset..

Errors in complex technology

The book "The Logic of Failure: Recognizing and Avoiding Error in Complex Situations" by Dietrich Dörner (ISBN-13: 978-0201479485; 1997) explains the mindsets of people who were trying to solve a problem and got it wrong. He goes into details of the action of the operators of the Chernobyl reactor, who were all trained experts. They weren’t incompetent.  They ignored safety standards because they knew what they were doing. Dörner identifies the mindsets that cause failures, and we can apply this insight to programming.

First mindset problem:

Once you have started in one direction, you stick to it.  But the Turkish proverb is right: “No matter how far you have travelled down the wrong road, turn around”. 

Second mindset problem:

Most programmers started with procedural languages and still carry that mindset into declarative SQL.  This is another proverb: “To a child with a hammer, everything looks like a nail.”

In SQL, I usually find this error in attempts to save bad DDL with TRIGGERs, CURSORs, stored procedures and application code.  Those are the ways to get procedural code into SQL. 

Third mindset problem:

People prefer static mental models, rather than a dynamic model.  This was also the model for traditional programming languages – when you compile a program, you get the same executable code every time.  That does not work in SQL; this is why we have query optimizers rather than procedural code in RDBMS. 

An example of that mindset was a small company that did medical lab reports over the internet.  They had to deliver 24 hour or less turn-around of accurate data.  The clumsy schema required convoluted queries and lots of procedural code to validate the data.  Most of the data validation could have been done declaratively much faster. Hey, it works, so why change it?   My personal favourite was keeping numeric data in strings, then casting both blanks and NULLs to zero. 

The business was successful and the data volume started growing.  But the queries were growing slower at a faster rate than the company, so you could see the train wreck coming.  Management’s solution was to buy more hardware.  This is like adding more rails to the end of the track; if the bridge is wrecked this does not solve the fundamental problem. 

Fourth mindset problem: 

People have a strong tendency to focus on immediately problems and not ask what will happen next.  This will then lead to problems caused by the solutions.  This is also known as the Law of Unintended Consequences.  There is a wonderful book "The Systems Bible: The Beginner's Guide to Systems Large and Small (Third Edition of SYSTEMANTICS)" by John Gall (ISBN 0-9618251-7-0; 2003) that details this phenomena. 

In IT, there is a proverb: “Nothing is more permanent than a ‘temporary patch’ in software”; I will let the reader fill in his favourite horror story here – we all have one.

We’ll illustrate a few of these mindset problems in some code written by a competent expert.  

Getting Rid of Procedural Code

This is from a real email I got.  I was asked if I could see any problems with the stored procedure below and if I had any tips regarding naming conventions or a better approach.   

ALTER PROCEDURE [dbo].[SaveEmployee]
 @Employee_Id INT OUTPUT,
 
@Employee_Username VARCHAR(25),
 
@Employee_FirstName VARCHAR(25),
 
@Employee_LastName VARCHAR(25),
 
@Employee_Status INT,
 
@Employee_Created DATETIME OUTPUT,
 
@Employee_Modified DATETIME OUTPUT,
 
@UsernameExists INT OUTPUT,
 
@CommandStatus INT
AS
BEGIN
 SET NOCOUNT ON;
 
IF @Employee_Id > 0
  
BEGIN
-- If employee exists, check if the employee's username already belongs to another employee
  
IF NOT EXISTS
      (
SELECT TOP 1 Employee_Id
        
FROM dbo.Employees AS A
        
WHERE NOT A.Employee_Id = @Employee_Id
          
AND Employee_Username = @Employee_Username)
  
BEGIN
  
SET @Employee_Modified = GETDATE();
  
UPDATE dbo.Employees
      
SET Employee_Username = @Employee_Username,
          
Employee_FirstName = @Employee_FirstName,
          
Employee_LastName = @Employee_LastName,
         
Employee_Status = @Employee_Status,
          
Employee_Modified = @Employee_Modified
    
WHERE Employee_Id = @Employee_Id;    
    
-- If employee is deleted, delete the employee orders
    
IF @Employee_Status = 3 -- Employee is deleted
      
BEGIN
       UPDATE
Employee_Orders
      
SET Employees_Orders_Status = 3
    
WHERE Employee_Id = @Employee_Id;
      
END
    END
   ELSE
    BEGIN
     SET
@UsernameExists = 1;
    
SET @CommandStatus = 3; -- Indicate failure
    
END
  END
 ELSE
  BEGIN
-- If new employee, check if employee's username already exists
  
IF NOT EXISTS
      (
SELECT TOP 1 Employee_Id
        
FROM dbo.Employees AS A
      
 WHERE Employee_Username = @Employee_Username)
    
BEGIN
    
INSERT INTO Employees (Employee_Username, Employee_FirstName, Employee_LastName, Employee_Status, Employee_Created)
   
VALUES (@Employee_Username, @Employee_FirstName, @Employee_LastName, @Employee_Status, @Employee_Created);
    
    
SET @Employee_Id = SCOPE_IDENTITY();
    
END
   ELSE
    BEGIN
     SET
@UsernameExists = 1;
    
SET @CommandStatus = 3; -- Indicate failure
    
END
   END
END
;

There is a lot of stuff that we don’t do in SQL here.  It looks more like a COBOL or BASIC program written in T-SQL.

  1. The “Employees” table ought to be named “Personnel” because it is a set.  But if you are still writing sequential file processing, your mindset only lets you see the trees, one at a time, instead of the forest. 
    Can you figure out why he would use “A” as an alias for the Employees table?  Alphabetical order and sequential thinking again!  In later code he used “B” for the second alias, etc.

  2. I'd suggest that Pascal and camelCase is bad for maintaining code. 

  3. What do the OUTPUT values do?  Well, @Employee_Id returns an IDENTITY value, so we have no verification or validation possible for this data element.  You need better ids than a count of physical insertion attempts to the hardware.
    @Employee_Created and @Employee_Modified are audit meta-data and should not be here at all.  You never put audit data in the table being audited.  But even here, he used getdate()  instead CURRENT_TIMESTAMP just to be proprietary. 
    @UsernameExists is a local flag, just like you would write in Assembly language.  Remember those days?  Make a test and look at a register!  He does not understand that SQL would use a predicate. 

  4. Look at more assembly language programming with @CommandStatus

  5. Why write “NOT EXISTS (SELECT TOP 1 ..)”?  This is silly, proprietary and it doesn’t even have an ORDER BY clause.  Now look at “WHERE NOT A.employee_id = @employee_id” and think about trying to maintain the code. 

  6. @CommandStatus is an example of “flag coupling”, one of the worst kinds of coupling in Software Engineering.  This procedure depends on that flag, this module will do completely different things. Look at what happens when you add a new guy with (@CommandStatus = 3); insert him into the Employees table, flag his orders for deletion and finally report an error with invented codes.  ARRGH! 

When you write older procedural code, the application and the data are coupled in the same module.  You have to do everything in one place.  There were no log files for audit data.  Files had no constraints.  The closest thing to limiting user access was a write-protect ring on a reel of magnetic tape (Google it if you are too young to remember). 

SQL uses DDL, DML and DCL sublanguages to assure data integrity and DDL is the most important line of defense. Let’s put all of those IF-THEN-ELSE control flows from the original procedure into constraints:

CREATE TABLE Personnel
(emp_email VARCHAR(255) NOT NULL
    
CONSTRAINT no_dup_emails PRIMARY KEY
    CONSTRAINT CHECK
(<< company VALIDATION RULE >>),
 
emp_username VARCHAR (25) NOT NULL
  
CONSTRAINT no_dup_usernames UNIQUE,
 
emp_firstname VARCHAR (25) NOT NULL,
 
emp_lastname VARCHAR (25) NOT NULL,
 
..);

Now back to a coherent module that does one job in itself.

CREATE PROCEDURE InsertNewEmployee
(@in_emp_email VARCHAR(255), -- company assigned
 @in_emp_username VARCHAR(25), -- employee choice
 @in_emp_firstname VARCHAR(25),
 
@in_emp_lastname VARCHAR(25))
AS
BEGIN
INSERT INTO
Personnel (emp_email, emp_username, emp_firstname, emp_lastname)
VALUES (@in_emp_email, @in_emp_username, @in_emp_firstname, @in_emp_lastname);
 
<< error handling here >>
END;

Pick your favorite error handling method here.  There is the traditional @@ERROR and RAISERROR or the new TRY-CATCH method. You will get the name of the constraint for display in the error message.

There is no need to invent a new @EmployeeStatus code to pass along to calling programs.  That was COBOL, 40 years ago.  The only serious problem is that SQL Server does not support the Standard SQLSTATE yet. 

The audit data is being done with a third party tool outside the table; deleting a row will not destroy the audit trail anymore.  

For this problem, I would put the update of an existing employee in a separate “CREATE PROCEDURE UpdateEmployee()” so that I can add some DCL on it so nobody can change other people’s data.  But if you want to have this in one module, you can use the new MERGE statement:

MERGE INTO Personnel -targettable

USING (SELECT emp_email, emp_username, emp_firstname, emp_lastname

        FROM (VALUES (@in_emp_email, @in_emp_username,

                      @in_emp_firstname, @in_emp_lastname)) AS X)

ON Personnel.emp_email = X.emp_email

WHEN MATCHED

THEN UPDATE SET emp_username = X.emp_username,

            emp_firstname = X.emp_firstname,

            emp_lastname = X.emp_lastname

WHEN NOT MATCHED

THEN INSERT (emp_email, emp_username, emp_firstname, emp_lastname)

     VALUES (X.emp_email,X.emp_username, X.emp_firstname, X.emp_lastname);

Remember to catch errors, of course. The advantages of the MERGE over all that procedural code are huge because

  1.  It is shorter and easier to read than line after line of procedural code.  

  2. It is Standard SQL!  It will port.  Other SQL programmers can read it and maintain it. 

  3. It is one statement, so there is no worry about isolation levels and other sessions.  Your error messages are for just one SQL statement rather than several different ones in a multi-statement block of code. 

  4. It is one statement, so it can be optimized.  Procedural language compilers do their optimizations in multiple passes and have elaborate symbol tables and other structures to get a good executable module. T-SQL is a simple one-pass compiler.  That is why you have to use the “@” prefix.  The prefix tells the compiler about the nature of the variable so it does not have to use a symbol table. 

Getting Rid of Non-Procedural Code

One common problem in SQL Server (but not other SQLs) is that it overly sensitive to referential cycles.  If Table A references Table B twice or more, you get an error message.  This makes it hard to have a data element play two or more roles in a table. 

But most of the time, the problem is a design flaw.  Here is an actual skeleton schema from a Newsgroup posting.  The poster asked what the design flaw(s) were.

CREATE TABLE Employees
(emp_ssn CHAR(9) NOT NULL,
 
emp_name VARCHAR(50) NOT NULL,
 
works_for CHAR(4) NOT NULL,
 
CONSTRAINT pk_Emp PRIMARY KEY(emp_ssn));
 
CREATE TABLE Departments
(dept_code CHAR(4) NOT NULL,
 
dept_name VARCHAR(30) NOT NULL,
 
dept_boss CHAR(9) NOT NULL,
  
CONSTRAINT pk_Dept PRIMARY KEY (dept_code),  
  
CONSTRAINT uq_Dept UNIQUE (dept_name));
ALTER TABLE Employees ADD CONSTRAINT fk_Emp_Dept
 FOREIGN KEY (works_for)
 
REFERENCES Departments (dept_code)
  
ON DELETE CASCADE
  ON UPDATE CASCADE
;
ALTER TABLE Departments ADD CONSTRAINT fk_Dept_Emp
 FOREIGN KEY (dept_boss)
 
REFERENCES Employees (emp_ssn)
  
ON DELETE CASCADE
  ON UPDATE CASCADE
;

SQL Server will return an error when you run this code:

Msg 1785, Level 16, State 0, Line 1 Introducing FOREIGN KEY constraint 'fk_Dept_Emp' on table 'Departments' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

Msg 1750, Level 16, State 0, Line 1 Could not creates constraint. See previous errors..

The immediate flaws are improper data element names, which is easy to fix with a global text replacement, but let’s get this out of the way. 

  1. The Employees table should be named Personnel.  It models a set and not one record at a time.  Even Microsoft gave up on camelCase, so the constraint names should be changed, too. 

  2. The "works_for" column is a bad data element name.  It is a verb that describes a relationship and not the name of an attribute.  We will get to that problem shortly.  It is serious and fundamental.

  3. "dept_code" is not an identifier any more that "shipping_code" would be.  A code is a different kind of attribute than a tag number or name. It would be something like “unfunded”, “over budget”, etc. 

  4. Being the department boss is not a property of a department, but of an employee's job assignment.  Is the driver a part of the automobile? 

  5. Likewise, "dept_boss" is a role played by some emp_ssn.  It is not some special attribute of a department. 

In short, you have built "Britney Spears, Automobiles and Squids" tables.  You have used improper data element names.  Let's fix the data element names first and consolidate the table declaration into one declarative statement per table. 

CREATE TABLE Personnel_and_OrgChart -- mixed table
(emp_ssn CHAR(9) NOT NULL PRIMARY KEY,
 
emp_name VARCHAR(50) NOT NULL,
 
works_for_dept_nbr CHAR(4) NOT NULL
  
REFERENCES Departments (dept_nbr)
    
ON DELETE CASCADE
    ON UPDATE CASCADE
);
CREATE TABLE Departments_and_Job_Assignments --mixed table
(dept_nbr CHAR(4) NOT NULL PRIMARY KEY,
 
dept_name VARCHAR(30) NOT NULL UNIQUE,
 
dept_boss_emp_ssn CHAR(9) NOT NULL
  
REFERENCES Personnel (emp_ssn)
    
ON DELETE CASCADE
    ON UPDATE CASCADE
);

While it is a good idea to name constraints, remember that you do it so that you can see those names in errors messages or use them to alter the table.  Naming a PRIMARY KEY is a bit redundant, isn’t it?  The error message will tell you about such violations.

Let’s next get Britney Spears and the Squids into their own tables by splitting out entities and relationships. 

-- tables that model entities
CREATE TABLE Personnel
(emp_ssn CHAR(9) NOT NULL PRIMARY KEY,
 
emp_name VARCHAR(50) NOT NULL,
 
..);
CREATE TABLE Departments
(dept_nbr CHAR(4) NOT NULL PRIMARY KEY,
 
dept_name VARCHAR(30) NOT NULL UNIQUE
 ..);
-- tables that model relationships
CREATE TABLE OrgChart
(job_title CHAR(9) NOT NULL,
 
dept_nbr CHAR(4) NOT NULL
  
REFERENCES Departments (dept_nbr)
  
ON DELETE CASCADE
  ON UPDATE CASCADE
,
 
PRIMARY KEY (job_title, dept_nbr),
  <<
nested set model>>);

For this example, let’s not bother with the organizational hierarchy and just assume that we have a nested sets model in the rest of the table.  I also made a two-column the key just to make the next table more interesting. 

CREATE TABLE Job_Assignments
(dept_nbr CHAR(4) NOT NULL
 
dept_name VARCHAR(30) NOT NULL,
 
FOREIGN KEY (job_title, dept_nbr)
  
REFERENCES OrgChart (job_title, dept_nbr)
    
ON DELETE CASCADE
     ON UPDATE CASCADE
,
 
emp_ssn CHAR(9) NOT NULL
  
REFERENCES Personnel (emp_ssn)
    
ON DELETE CASCADE
     ON UPDATE CASCADE
,
 
PRIMARY KEY (job_title, dept_nbr, emp_ssn),
 
..);

The DRI cycle problem is now gone. 



This article has been viewed 10497 times.

Joe Celko

Author profile: Joe Celko

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Search for other articles by Joe Celko