Abstract:
So you own a relational database and now want to do something with it. The
first step is to come up with a problem you need to solve, like keeping a
home inventory for example. But where do you go from there?
This session will give you an introduction to thinking about your problem
"relationally" -- how to best capture and express your data as not just any
database, but as a relational one. We'll cover both the theoretical and
practical aspects (and the trade-offs between them) so that you can take
better advantage of whatever relational database product you're using.
So you own a relational database product (like DB2, or Approach, or
Sundial's own DBExpert) and now want to do something with it.
Of course, the first step is to come up with a problem you need to solve...
but where do you go from there?
Typically, when you want to use a database to solve a problem, the first
step of the process to think about the "things" that you want to keep track
of. That's because, when you really get down to it, a database is nothing
more than a collection of details about a bunch of individual things...
For this presentation, we're going to concentrate on using a database to
keep track of a "home inventory" (such as you might want to have for
insurance purposes) but the principles involved are essentially the same no
matter what you want to use a database for.
So, what "things" do we want to keep track of?
In this particular example, most of the things are pretty concrete "real
world" objects like TVs, VCRs, computers, furniture, china, jewelry, tools,
etc. But "things" in a database aren't necessarily concrete -- they can be
more conceptual objects like insurance policies or real estate ownership
records or the like.
Regardless of the particular database, however, the key design idea is to
figure out what information to keep track of about these things and how
to organize that information so it can be used effectively.
But how do you go about doing that (especially if you haven't done it very
much before)? Well there really is no one best way... and certainly there
is no step-by-step recipe that works for every database in every situation.
And the formal methods that have been defined by several experts in the
field are just plain overkill when it comes to the needs of most database
users.
Thus, our goal in this presentation is to simply give you a series of
"guidelines" that will help you think about things "relationally" and that,
when applied with care and understanding, will give you quality relational
database designs.
So let's get started...
Guideline: "It's the types of things that are important, not the things
themselves"
Somewhat surprisingly (to database newcomers), the actual information
(the details about your particular VCR for instance) are some of the last
things you worry about when coming up with a database.
What's far more important is figuring out the "kinds" or "types" of things
you are going to keep information about -- a process sometimes called
"abstraction". What you need to do is figure out the structure of the
database -- that's the hard part. (Later, after you use that structure to
define your database using some particular database management product, you
then "populate" the database with the actual information -- that's the easy
part, though it can be quite tedious if you have lots and lots of data.)
For many database design problems, this first step of the organization
process is very much the same as you might go through if you just wanted to
keep track of the same information on paper -- it's simply coming up with
what "lists" of information you would want to have.
For our home inventory, if we were doing it on paper, we might just end up
with one long list of everything we have. (Or we might use several
different lists like "things in the garage", "things in the house", etc.,
but we'll return to that issue later.)
It's quite common, in fact, for you to only end up with one such "type" (or
perhaps a few) as a result of this first stage of the design process.
Guideline: "The world is flat"
Just like when you try to put a home inventory "on paper" and that paper
model of the world imposes certain constraints on what you can do,
different database systems have rules about how you can "structure" your
information.
When it comes to relational databases (which are, by far, the most common
form of database in use today), the most important structural constraint is
that the database consists of nothing but tables. There may be logical
(not physical) "connections" between tables (as we will see later) but it's
a flat, two dimensional world of nothing but rows and columns of
information.
Thus, there is no concept in a relational database of hierarchical
relationships... "trees" of information (such as the tree view of the
directories on your hard drive) can't be represented directly.
Further, there is no concept in a relational database of "containment"...
tables cannot contain other tables and all values in the table must be
"simple" (whatever that means... we'll return to that later).
So, basically, all a table is... is a list of information about a bunch
of similar things... where each row represents one such "thing" you want
to keep track of.
(As far as the "structure" of the database is concerned, you really don't
care very much about the individual rows of data. Each row represents, for
instance, one item in our home inventory -- a VCR, a TV, an antique dining
table, or whatever -- but those details are only important later after
you've created and database and start use it by filling in the data.)
Thus, a table typically represents a "type" of thing and each row
represents the specific information about one such particular thing.
And, thus, as part of your initial design process, you are going to end up
with a table for each of the "types" of thing you need to keep track of for
your particular application.
Guideline: "The world is unordered"
Unlike some other forms of databases, relational databases make no
guarantees about the order of entries in a table.
Tables are unlike many lists we use in the real world in that the entries
(rows) aren't in any particular order. In fact, the database system
reserves the right to change the order at any time -- even from one time
you look at a table to the next.
This may sound like potential chaos, but it isn't. Relational databases
do allow you to specify that you what to see (or use) the rows in a
particular order... but this only happens if you ask and it doesn't say
anything about the order in which the rows of the table are actually stored
away someplace on disk.
Thus, in our home inventory, we don't need to worry about whether we want
our item list to be in alphabetical order... or in the order the items
were acquired... or in some other order. In general, we can ask to see
the data in any order at any time whenever the need for order arises.
But it also means that we can't depend on having one particular row after
another (for instance)... or do things like continuing information from
one row into the next. Each row must be able to "stand alone" as an
independent entity.
Guideline: "The differences are in the details"
If each row of each table represents a "thing" you want to keep track of,
what do the columns of the table represent? Well, if all the rows in the
table represent different things, then something makes these different
things "different". And, it's the columns that record those detailed
differences.
For instance, things listed in our home inventory might have a description,
a price, a color, a date they were acquired.
These are called attributes (or characteristics or properties) and are
represented by the columns in the table; each row contains the values of
those attributes for a particular item.
In contrast to the rows of a table, the columns are part of the structure
of the database and are something you need to worry about "up front".
For one thing, relational databases typically require that all the values
in a column must have the same "data type". In other words, they must all
be numbers, or all be dates, or all be strings of text, or whatever. The
specific types you can choose from will vary from one database product to
another but virtually always include such basic types and perhaps many
others.
Further, as a consequence of the "flat world" requirement, the values in a
column must be "simple". They can't be lists, they can't be other tables
-- they must be simple singular values consistent with the data type of the
column. Later, we'll see what you need to do if you have a characteristic
that isn't "simple" and how you need to change the structure of your
database to accommodate that situation.
(Now don't confuse "simple" with "short." The data-type could be a text
string of significant size -- or perhaps virtually unlimited length -- or
it could be a bitmap or other binary data. The particular data-types that
are available depend on the database product being used but the key concept
is that the values have no "internal structure" as far as the database is
concerned -- they are "simple" values.)
Also, in keeping with the "unordered" nature of tables, you can't make any
assumptions about "which column comes first". While a database system will
usually always display the columns of a table in the same order left to
right, there is no guarantee of this and you can never reference a column
by its position -- you can only refer to the characteristic of an item by
that characteristic's name, not by a "column number".
Guideline: "Beware of the unknown"
Ideally, the things represented in a table all share the same
characteristics... but just because there is a "slot" in the table for
each property of each thing, that doesn't mean you need to keep track of
that piece of information.
Most relational databases support the idea of a NULL value. A null value
for a particular row and column means simply that the particular
characteristic "doesn't apply" (or is "unknown") for the particular item
represented by that row.
So, in a home inventory, for example, you might have a color column as
noted above. But some items, like an oil painting, might not have an
identifiable "color"... or, like a camera, might have a color but the
color isn't anything you care about... or, perhaps like that old trunk
stored in the attic, might have a color but you don't know (or don't
remember) what the color is.
The important thing to know about NULL values is that they really mean that
you don't know (or don't care) and that the database system doesn't know
either. If it's a numeric field, a NULL value is NOT the same as having a
zero (or some other number) in that field -- for instance, if you add NULL
and 3 together, you don't get 3, you get NULL.
Or, if you ask the database system "what do I have that is red", the result
will not include that trunk in the attic. And, importantly, if you ask it
"what do I have that is NOT red", the result will still not include that
trunk in the attic -- that's because the database cannot tell if the trunk
is red or is not red. (Most database systems also have a way for you to
ask something like "what do I have that does not a have a color" or "what
do I have with NULL for the color" -- and then that trunk in the attic,
with its NULL color value, would be included.)
The bottom line here is that NULL values are very useful but can cause
unexpected results if you don't plan to use them wisely.
Guideline: "It's a database, not an encyclopedia"
Once you start thinking about the characteristics of the things in your
database -- in other words, the columns for your tables -- it can be hard
to know when and how to stop the process of coming up with more details.
And where you stop can greatly affect the design of your database.
Another way of saying this is that you should only try to keep track of the
characteristics that are important and/or relevant to your particular
application.
In other words, how you keep track of all the items in your house for a
household inventory (for insurance or whatever) is probably different than
how you might keep track of all the items you have stored in your garage
(or a rented warehouse) for some other purpose (such as for you to sell on
ebay).
Thus, in your household inventory database, it's probably not relevant to
keep information about who in your family bought the item or what size box
it was delivered in or what the weight of the item is.
Sometimes, however, it's not entirely clear if you should include a piece
of information or not. For instance, should your home inventory include
which credit card was used to purchase the item? Superficially, this would
seem to be irrelevant. But if you have some credit cards that offer
automatic extended warranties or insurance against stolen items, then the
information is probably relevant to include.
Fortunately, you don't have to be 100% correct about such things "up front"
and you generally are not "locked in" by the decisions you make in your
initial structuring of the database.
Most relational databases let you add additional columns to a table...
so adding more characteristics later isn't a big thing. (Other than adding
all the data!) And just as a point of reference, it's usually the case
that when you do add a column, the value of that column for all your
existing rows is set to NULL (meaning "unknown" in this case).
But, as we will see later, some additions could change the way you need
to structure your database. So it's best to at least consider all the
possible characteristics you can think of "up front" to see if they will
impact your design (even if you don't choose to include them).
Guideline: "Group like things together..."
Another way to think of a table is that it represents a "group" of
"similar" things... with each row containing information about one of the
things in the group.
They are "similar" in the sense that they have the same characteristics (as
represented by the different columns).
As with most "groups" in the real world, the things in a table typically
share some common characteristics but not necessarily all. We've seen
that NULL values can be used to fill in the gaps when a characteristic
doesn't apply.
So, the question is, what makes things "similar" enough that they should be
in the same table... and the answer is that it's up to you.
But, for most purposes, it's better to keep things similar than it is to
make them different. So, when you have a choice, assume things are
similar... and enhance your table definition to accommodate this if you
need to.
So, for instance, when it comes to your home inventory, you might
originally come up with a list like:
- things in the house
- things in the garage
- things in the yard
You could use a different table for each of these, but you're better off
thinking of them as:
and adding a new characteristic called "location" and using that to keep
track of whether they are in the house, the garage, or the yard.
One other advantage of this is that it's much easier to expand your
database if you need to add a "new" category of things like:
- things in our vacation house
since you don't need to create a new table. Instead, you just start using
a new location value ("vacation house" in addition to "home", "garage",
"yard") in the appropriate rows of the table. Similarly, if you decide to
go in to more detail at some point in the future, you could update your
general "home" entries with more detailed location values like "den",
"living room", "guest bedroom", etc.
Guideline: "... and don't group unlike things together"
The analog of grouping like together is to keep unlike things apart.
There is, for instance, nothing to prevent you from putting all your data
in one table... even if some of the items share nothing in common. For
instance, if your database not only includes home inventory information,
but also ball game schedules, you could have a "home inventory and ball
game table" with a column that you use to distinguish between the two types
of rows... but that's not a good idea.
If the purposes for which you will use the data are radically different
or there are so few characteristics in common between the different uses
(so that you will end up with lots and lots of NULL values in the table
because they don't apply), then it's best to separate the "unlike" things
into two (or more) groups and thus two or more tables.
"Grouping" is the first and foremost reason why you want to divide your
database into multiple tables.
Another way to help check to see if you are following this guideline is to
look at how you might name a table. If you cannot come up with a
relatively simple (yet descriptive) name, then it may be that you are
trying (intentionally or unintentionally) to group unlike things together
in that table.
Guideline: "Be able to distinguish this one from that one"
If each row of a table represents a different item, then you generally need
to be able to decide which row describes the characteristics of which item.
Another way to put this is that something in each row should distinguish
that row from all other rows in the same table. And, similarly, one (or
more) characteristics in each row should be usable to "identify" that
particular row.
(Most, but not all, relational database systems don't let you have two
entirely identical rows in the same table.)
As we will see later, its useful (though not required) to be able to
identify a particular row by something "simple" (like a number) rather than
by a text string (like a name) or by some complex combination of values in
several columns.
If you are coming from a file-oriented background (or from the approaches
used in some non-relational databases), you might think of using the
"record number" or "row number" to do this.
That won't work in a relational database, however. Remember that there is
no guarantee that rows will even be kept in any particular order and thus
that there really is no concept of a "row number" (except perhaps on a
transitory basis as you are looking at something).
If your items don't have such a simple, singular characteristic to identify
them, it often pays to create an artificial characteristic of your items
just for this purpose -- an inventory number or asset number in our home
inventory database for instance. (We'll see an example of where this can
be useful later.)
Guideline: "Don't duplicate information"
Having information in one characteristic that directly depends on
information in another characteristic is a bad thing that can lead to all
sorts of problems. Thus, it's very important to do some "dependency
analysis" to make sure you aren't duplicating information.
For instance, one thing you might want to keep track of in your home
inventory is where you purchased the item... perhaps so you can know how
much "stuff" you are buying from particular stores.
If all you want to keep track of is the store name (for instance), then
this makes sense as a simple characteristic of the individual rows.
But what if you want to keep track of more information -- like the store's
customer service phone number and/or their address?
Each of these pieces of information is dependent on the other -- they are
always the same for any given store.
You could make all these individual characteristics and repeat this
information for each item that you bought from that store... but if you
buy a lot of things from that same store, that's a lot of information to
enter over and over. And if the store's phone number changes (for
instance), you've got a lot of places to update it and you might make a
mistake and forget some.
What makes more sense is to record this information once for each store
-- a process sometimes called "factoring" common information. This would
result in you having a list of stores in addition to your list of inventory
items -- i.e., two tables rather than one.
In your inventory table you would only identify the store... and then use
the other table to record the characteristics (name, address, phone number,
etc.) of each store you buy things at.
If you buy multiple things at the same store, then the "store identifier"
will appear in multiple rows of the inventory table... but there will be
only one row for that store (and it's characteristics) in the store
table. Thus the store characteristics will not be duplicated -- there
will be less data to enter, less chance for error, and less data to take up
space in your database.
This is also one of the reasons why being able to identify individual rows
in a table (as discussed earlier) can be quite important. How you
identify an individual store (in the store table) is how you reference
that store in the inventory table.
(The identifier will, of course, be duplicated -- that can't be avoided.
It is what ties the items in one table to the items in the other table.
This is, however, an argument for using a "simple" identifier -- such as a
number -- rather than something like the store name -- as the identifier.
Most database management systems have special tools and techniques for
helping you manage such identifiers.)
Guideline: "Don't store what you can compute"
Computers are very good at computing things... they do it much more
reliably than people do. In fact they do it much more reliably than people
even enter data into computers.
In terms of database design, that means you should let the computer compute
things whenever possible.
Thus, you may have things that can be computed based on other things in
your database, for example:
- the price of a quantity of items as the single price times the quantity
or
- the total value of a collection as the sum of all the individual values
If you store the original data and those computed results in your database,
sooner or later you're going to change one, forget to change the other, and
end up with inconsistent (and thus incorrect) data!
So it's generally best to store only the original data and then compute
the resulting value whenever you need it (to print on a report, to display
on the screen, or to use in some other computation).
(Essentially, this guideline is just another form of the one about not
storing duplicated information -- information you can "compute" is a
"computed duplicate" of other information you have stored in the database.)
Guideline: "Does this have more than one of that"
This is, absolutely, the 100% most important question you need to ask
yourself about each table you might need to create and is a direct
consequence of the "flat world" constraint of relational databases.
In essence, you need to analyze each characteristic of a table and decide
if you might have more than one value of that characteristic for a given
row.
For some things, it will be absolutely clear that you have only one such
value... for instance, the serial number of an item in the home inventory.
(Having no such value is OK as well -- you would have a NULL value for
the serial number for instance.)
For others, it will be very clear that you have more than one value...
while for still others it will be questionable at best. That last category
is the hardest to deal with -- and we'll see why later...
Our home inventory doesn't have a clear cut example of something where we
would always have multiple values of a characteristic (at least as far as
the properties most people would want to keep track of), but let's create
one (somewhat artificially) so we can see what the problem is.
Let's say that you want to keep track of the date when the item was
serviced or repaired... not just the most recent such date, but the dates
that the item has ever been serviced or repaired.
You can't do this with just one such characteristic (since you can't list
more than one value in a "slot") unless you have multiple rows in the table
for the same item -- each with a different date value.
In other words, say you have a VCR that's never been serviced. It's
represented by one row in the table. Now you get it serviced and fill in
the service date. But as is often the case with VCRs, something goes wrong
and you get it serviced again. Where do you put the date? So far, you
have no choice but to duplicate that VCR row in a new row and fill in the
new date. And, if you get it serviced again, then there's still another
row to add...
There are a whole bunch of reasons why you don't want to do this. The most
obvious is that it violates the "don't duplicate information" guideline --
things like the serial number, the store name, etc., will all be duplicated
in each one of the rows about that particular VCR.
It also violates the guiding idea that a row in the table corresponds to an
item in the (real) world that you want to keep track of -- now you have two
or more rows corresponding to that item. That also means that you have
probably also violated the guideline about uniquely being able to
identify each row -- certainly something like the serial number, or even an
artificial "asset number", which would have worked great before isn't going
to work now because you have that number in more than one row!
So, how do you solve the problem. (We'll ignore the obvious possibility of
just throwing out the VCR and getting a new one.)
It you know you may have a large number of possible values for a
characteristic, you really have only one reasonable way of solving the
problem. You need to create another table and "link" each row of "this"
table to zero or more rows of "that" table.
For our somewhat contrived home inventory example, this would mean creating
another table of "service dates". And while you might want to expand the
details later, the new table might initially have only two columns: one to
contain the actual date and another to contain the "identifier" (inventory
item number or whatever) of the corresponding item in the original home
inventory table.
We haven't taken the time in this presentation to describe much about how
you use multiple tables like this, but let's just say it's fairly easy
with most database products to "join" the tables together and thus find out
which service dates go with which inventory items (and have these listed on
the screen or printed in a report or whatever). The common "identifier"
fields (between the two tables) are used to match things up.
(Now if you are paying close attention, you are saying, hey, wait a minute,
the service record table does not have a nice, clean, simple value that
can be used to identify each record... Well, that's OK -- since the whole
purpose of this table is to represent multiple values from records in the
other table, the items in this table don't really need to be uniquely
identified. They are, however, still unique when you take the combination
of the inventory identifier and the service date together.)
As noted above, using multiple tables like this is the only practical
solution when you have (or might have) a large number of multiple values
for a particular attribute of a particular record. But, when the number of
possible multiple values is small, say two or three, you need to weigh the
trade-offs
Some designers would tell you that you should always still separate this
information out into another table. But, to be frank, having multiple
tables always increases the overhead in processing your database and also
can make it more complicated to do some things that are relatively easy
when only one table is involved.
So, for instance, if you knew you only wanted to record the last two
service dates for an item (rather than the entire service history), you may
well be better off just having two or three characteristics (say "most
recent service date" and "earlier service date" for instance) to record
that information in the original table.
When the issue of multiple values is questionable, then, likewise, you need
to consider the trade-offs. The key thing you need to worry about "up
front" in your design is whether you think you will ever need to separate
the information out into that other table.
If you have a requirement that you keep track of multiple values (like
service dates in this case) and you must keep track of them all, then
it's probably best to make the decision to use another table. The reason
for this is that, unlike just adding another column to a table, adding a
new table and connecting up all the corresponding rows is not an easy thing
to do once you have populated the database with information.
On the other hand, if you know you won't ever need more than a couple of
the values for a possibly multi-valued attribute, it's reasonable to at
least consider just having multiple attributes in the original table. This
will make your database less complicated... but you must realize that
you have traded off long term flexibility in the process.
And the complications associated with just that type of problem bring us to
our last, very important guideline...
Guideline: "Guidelines are meant to be broken"
All guidelines, no matter what their source, are just that -- they are
"guidelines" not rules.
It's 100% OK to "break" them... but they shouldn't be ignored. The key
thing is to know when you are breaking them and take the time to make a
conscious decision that that is what you want to do.
|