How Relational Integrity

can simplify your database project,

shorten development time,

and improve the quality of your life.

Unless you are a seasoned database pro,

"relational integrity" might be just

another fancy marketing phrase that you don't really understand

but are vaguely suspicious of.

If the following explanation helps you, I will be very,

very happy -- you don't even have to tell me; I'll know.

And if you'd like more information,

please contact me; I do database work for a living,

answer email for free, and

would love to arrange some onsite training or consulting

at your company.

To illustrate relational integrity concepts,

I use the following sample database:

This database could be for an ezine (an electronic magazine web site),

a famous quotes search engine, an online library,

or any application which has to keep track of content (articles)

by owner (author).

If you already know what RI is, please feel free to

skip over the following

basic explanations.

What is a Relational Integrity?

A relational database

contains tables of data which are related to each other.

For example, articles are related to the author who wrote them,

and conversely, authors are related to the articles they wrote.

(Sounds obvious, and it is.)

That's the relational part.

When we work with a relational database,

we naturally expect that data in related tables stay related.

For example, articles written by MartinB should always

be related to MartinB and never be confused with

articles written by any other author.

That's the integrity part.

Relational Integrity is also called

Referential Integrity, perhaps because the

mechanism for ensuring integrity is implemented

by the way the tables reference each other.

Both terms are okay with me. I usually

just say RI anyway.

How RI Works

Tables are related using data values.

Usually, these are the values of numeric keys instead

of text fields. (Database designers love assigning

numeric keys, mainly because it's more

efficient than relating tables through

text fields, but also because most databases

can generate autonumbers or sequence numbers

for this purpose.)

So let's say MartinB is an author in the

database and has author number 32.

Further, let's say that for some reason

(which we won't go into here, as it's not important)

you need to change MartinB's author number from 32 to

45678. The way it works is that you

change his author number on his author

record in the author table.

And then you had better also change

any references to author 32 in all related tables
--

like the articles table --

because if you don't, they'll still

refer to author number 32, which no longer exists.

Mercifully, you don't have to do this yourself.

This is where RI comes in. RI can do it for you.

The underlying structure and enforcement of RI

is usually handled by the relational database system.

I say usually because in all database

systems I've ever worked with, RI is optional,

absurd as that seems.

Database developers have to turn RI on,

or else ask the developers to provide external enforcement

procedures outside the database system.

So if you don't have RI turned on for your particular

database, then when the day comes -- and you know it will --

when you have to make a change

and you forget the external procedure(s) to change all the

necessary related data values, the database will be said to have

lost integrity. You will have "broken links" in your tables.

You will be up the creek. (Next April 1, mention to a DBA that

there are broken links in his database, and watch his face...)

You will, I am certain, by this point have guessed

that RI can be a very useful feature to employ in your database.

Let us take one more brief

moment to cover some basic definitions

before getting down to business.

Again, please feel free to

skip over these definitions.

Parent and Child

Every database relation is a one-to-many

relation involving just two tables, a parent table and a child table.

Yes, you may hear other terms like one-to-one, many-to-many, and so on, but

these are all just special cases of the one-to-many relation.

Don't let them fool you. Understanding databases is easy if you keep it simple.

So each relation is one-to-many, involving a parent table

and a child table.

This terminology is not standard, merely convenient, if somewhat unfortunate.

Some people say master and detail; sometimes, I'll

say owner and member, a flashback

from pre-relational CODASYL databases

as out of date as bell-bottom pants (except that unlike

pre-relational databases, bell-bottoms may one day be back in style).

As you might guess, the parent is the "one" and

the child is the "many."

Each parent can have many children.

The reverse, I am afraid, is not true, not in a relational database.

A child can have only one parent.

Please don't

dwell too long on the obvious fact that in real life a child

has two parents -- you will only get confused. In a relational database,

in a one-to-many relation, each child has at most

one parent
. This is important. This is also why the

parent-child terminology is unfortunate.

Let's look at our sample database.

Each author is related one or more articles, the articles written by that author.

The author table is the parent table, and

the article table is the child table.

Note that an author can write many articles, but an article has only one author.

This is a natural consequence of it being a one-to-many relation.

The scenario where an article

is co-authored by more than one author

involves a many-to-many relation,

the contemplation of which, like the reality that

in real life a child has more than one parent,

is beyond the scope of this discussion; it will only confuse you

now.

Primary and Foreign Keys

Relational databases implement RI

using primary keys and foreign keys.

Other than their special role in RI,

these "keys" are like any other database fields -- they

contain data values.

The parent table contains the primary key and

the child table contains the foreign key which

references (or "points" to, if you prefer) the parent table.

Because each child relates to only one parent,

the child table needs only one foreign key.

So where are the "many" child records in a

one-to-many relation? They're

in the multiple rows of the child table

that all have the same data value in

their foreign key
, all referencing the same parent.

For example, consider the following articles:

AuthorArticle
32Your clients need a CMS


The Tao of Testing
88640 x 480 Isn't Dead Yet


Inside the evolt.org Rebuild
20Liquid Tables

Above, you can see the one-to-many relation clearly.

Now look at it the way you normally see

it actually stored in the article table:

ArticleAuthor
Your clients need a CMS32
640 x 480 Isn't Dead Yet88
The Tao of Testing32
Inside the evolt.org Rebuild  88
Liquid Tables20

Now you can see more clearly that each article is

related to only one author. This is the "many-to-one"

reversal of the one-to-many relationship.

It amounts to the same thing. You can always say something

about a one-to-many relation in two different ways.

 

RI Actions

Finally, we get down to business.

In a relational database,

there are four basic actions you can perform on a table:

These actions, known as SIUD, are carried out

by the -- wait for it -- Select, Insert, Update, and Delete SQL statements.

(These actions are sometimes, as in business process modelling, also

called CRUD, using the initials for Create, Reference, Update, and Delete.

CRUD is easier to remember, but not as helpful as SIUD, because

you may end up getting confused. For example, there's no

Reference SQL statement, but there is a Create SQL statement,

although it doesn't create any data. I'm sorry I even

mentioned CRUD. I just wanted to

set the record straight, point out again that simplicity rules,

and make sure you were awake.)

Three of these actions -- Insert, Update, and Delete --

change the contents of the database table and therefore have RI implications.

Remember, you want to stay on top of changes, preferably letting

the database itself handle them, to ensure you

never have "broken links" in your data.

Now it gets a wee bit complicated.

For every relation, the Insert, Update, and Delete actions

can be performed on either the primary key (in the parent table), or the foreign

key (in the child table).

There are thus a total of six possible RI actions for

every one-to-many relation:

Parent Insert Child Insert
Parent Update Child Update
Parent Delete Child Delete

Each of these six possible RI actions

needs to be examined when you

design your database, because what you want to do

is turn over as much of the enforcement

of RI as possible to the database itself.

RI Rules

Now it gets a little more complicated.

For each of the six RI actions mentioned above,

one of five RI rules can be implemented:

Cascade  perform same action on related keys
Restrict  disallow action
Set Null  set related keys to null
Set Default set related keys to their default value
None  allow action, do nothing to related keys

Does this mean that implementing RI

requires choosing one out of a possible thirty different

combinations of actions for each one-to-many relation??

The answer is yes -- but I hasten to add that it isn't really as bad or as much

work as it may seem!!

In fact, most database systems do not let you

define all of the above rules. Where you cannot

define the rule, it typically defaults either to Restrict or None.

Sample RI Action Rules

Let's say you want to insert a new article, i.e. a child insert action.

In most databases, the Restrict rule would be

in effect (assuming you have defined the foreign key properly,

i.e. RI has been turned on).

What this means is that you cannot insert an article for an

author that doesn't exist; you are restricted from doing so.

This is typical behaviour

for most one-to-many relations. You have to add the parent to

the database first, and then you can add child records

that reference that parent.

In fact, if you wanted some other rule for Child Insert,

you would find it extremely tricky to implement. Trust

me, it's messy. Luckily, this requirement hardly ever

comes up. I shouldn't even have mentioned it.

Of the five possible actions for Child Insert,

only Restrict really makes sense.

Okay, I lied. Actually, there is one other action that

makes sense, in a macabre sort of way, only because you see it all the time,

not that you would really want to define it

this way on purpose (assuming you have been paying attention).

If you do not turn on RI for the author-article

one-to-many relation, the Child Insert action defaults to None.

In other words, you can add an article that references any author number

you like, the database will not check it for you,

and who knows, it might result in a broken link.

So leaving RI undefined for a given relation

results in Child Insert: None.

Maintaining the author-article relation involves six different actions.

Below, I have chosen the RI rules that "make the most sense" to me.

See if you agree or disagree with my comments:

  1. Author Insert: None


    insert new author, do nothing to related articles


    This is fairly trivial, since new authors wouldn't have any articles yet

    (assuming other sensible RI rules are in effect).

  2. Author Update: Cascade


    update the author's primary key and also all matching article foreign keys


    This is the crucial action we've been talking about that you

    want the database to do for you, and while you might not need

    to change a record's primary key often, when you need to do it,

    you want Cascade.

  3. Author Delete: Restrict


    delete author only if no articles have been written by this author


    If you allowed the author delete to happen and left the

    articles in there, they would all have broken links; Restrict prevents this.

    Alternatively, if you specified the Author Delete rule as Set Null,

    the author would be gone and the articles would be left behind but you

    wouldn't know who wrote them; this is the only other choice that

    makes sense to me, but maybe not as much as Restrict.

    (And I'm sorry, I really cannot get into

    a discussion of nulls at this point.)

  4. Article Insert: Restrict


    insert new article only if the author already exists


    This is the example discussed in detail earlier.

  5. Article Update: Restrict


    update article to a different author but only if the new author

    also already exists


    Sometimes this is required to correct an input error.

  6. Article Delete: None


    delete article and do nothing to author


    This makes sense, right? Note that if the article being deleted

    is the author's last or only one, this may result in an author

    with no articles -- but that's not the

    same thing as a broken link.

Options for Implementing RI

If you made it this far, congratulations --

things are about to get a whole lot simpler.

Standard SQL-92 syntax provides a mechanism

for setting RI rules only for the parent delete and parent update actions.

All others take a default. The rules are defined when

declaring the foreign key on the child table:

CREATE TABLE child ( columndefinitions,


  [constraintname] FOREIGN KEY (fkcolumns)


   REFERENCES parent [(pkcolumns)]


   ON DELETE


    CASCADE SET NULL SET DEFAULT NO ACTION


   ON UPDATE


    CASCADE SET NULL SET DEFAULT NO ACTION


  )

As mentioned, all other rules take defaults, and

if you do not define the parent delete and parent update rules,

they take defaults too:

ActionDefault Rule
Parent InsertNone
Parent UpdateRestrict
Parent DeleteRestrict
Child InsertRestrict
Child UpdateRestrict
Child DeleteNone

None of these defaults is very surprising, when you think about it.

In fact, the only action for the author-article relation

that we thought made sense to be different from the default

is Author Update, which we decided we wanted as Cascade.

Luckily, that is one of the rules that can

be easily set by the SQL syntax.

So what's the big deal, then? If all the defaults are good,

and we can change the ones we want, how will RI

simplify your database project,

shorten development time,

and improve the quality of your life?

The answer comes when you consider how some

applications go about maintaining relational integrity

when RI is not turned on in the database.

Suppose the application wanted to insert an

article, and also wanted to make sure the author

already exists in the database. It is

not difficult to imagine a developer

coding up something like this --

There is nothing inherently wrong with this approach,

other than that

  1. it requires an additional database call, which

    is often much less efficient, and
  2. it takes extra time to do this programming

With RI turned on and Article Insert set to Restrict (the default),

the developer simply codes

It is not even always necessary to

test the return code and issue a friendly error message.

Assuming the foreign key constraint is named properly,

the database error message will basically say something

like "ERROR: integrity constraint ARTICLEAUTHOR violated - parent key not

found" and this is often enough to indicate what's going on.

Conclusions

When designing a database, after all the

entities, tables, columns and data types have been identified,

you will have a number of one-to-many relations.

Examine each relation

in light of the six possible actions,

and the five possible rules for each.

Start by considering all the default rules.

Pay special attention to the parent update

and parent delete rules, because those can

be changed easily, and often are.

Then prepare a rule table for each

relation, similar to the samples we described earlier

for the author-article relation:

ActionRule
Author InsertNone
Author UpdateCascade
Author DeleteRestrict
Article InsertRestrict
Article UpdateRestrict
Article DeleteNone

Highlight the rules that are different

from the default, as above.

Then pass these rule tables as specifications to your developer.

You'll save a lot of hassles, development time,

and unnecessary coding. And you might just see

an increase in the quality of your life.

Footnotes

I am indebted to ERwin, a popular data modelling tool,

for first introducing me to the concept

of RI actions outside the scope of what is

normally defined by SQL syntax. You would

not believe what you have to do, and what ERwin

can generate in the way of triggers and stored procedures,

to achieve non-default RI rules for actions

other than parent update and parent delete.

If your project requires these, give me a shout.

"Improving the quality of life" is a

requirement that should be part of every database

specification. I was taught this

in my very first database design course

for IDMS in 1977 by William Casey of Cullinane Corporation,

and it's as true today as it was then.