Tag Archives: relational theory

Relational Database Basics: What is a relation?

The biggest misunderstanding people tend to have with the relational model must be the understanding of the term “relation” itself. Since people tend to learn relational theory as an add-on to learning about SQL, they naturally learn that the things you put the data in are called “tables” and that tables are related to each other. The natural (but incorrect) assumption then, is that “relational” refers to the relationships that exist between tables, and this couldn’t be more wrong.

The simplest explanation

Put simply, a “relation” is what SQL calls a table. If you learn nothing else about relational theory, at least understand this. This is an oversimplification of course, but it’s close enough to being true that if you don’t want to learn any theory, it will at least make the discussions of theorists more comprehensible.

The mathematical explanation

This isn’t the best way to understand what a relation is, but if you intend to have meaningful discussions with other practitioners, you will need to have a common understanding based on a definition that is unambiguous. I’ll therefore get the mathematical explanation out of the way here; if it doesn’t make much sense, return to it after the more intuitive description below. Though this is a relatively formal description, it doesn’t come close to being totally precise, and anyone who wants to know more is encouraged to investigate a book on the subject.

An attribute is a combination of a name and a  type identifier, where we can for the moment treat a type as being a (possibly infinite) set of values with some operators defined on it. Think of an attribute as being like a column definition.

A tuple is a set of distinct attributes, where each attribute is associated with one value that is an instance of the type for that attribute. The members of a tuple do not posess an inherent order, and tuples ordered in different ways for display purposes nevertheless represent the same tuple.

A relation consists of a heading and a body. The heading is a (possibly empty) set of attributes with distinct names. The body is a (possibly empty) set of tuples, each of which has the same set of attributes as the heading of the relation.

To put this in terms familiar to an SQL user: an attribute is analogous to a column definition, a tuple is analogous to a row and a relation is analogous to a table. Note that this is an over-simplification, mostly because we think of the rows and columns of a table as posessing an inherent order, and mathematical relations have no such order.

One other thing that bears stating at this point is that a relation is technically an immutable value, and is held in a mutable variable called a relvar. This is analogous to common programming practice where an integer like 5 is immutable, but held in a mutable integer variable. If you “insert a row into a table”, then actually you change the contents of that relvar from one relation to another. This distinction is rarely of relevance in discussing theoretical issues.

The intuitive explanation

Unless you’re already familiar with relational theory, that was probably all rather unclear, in which case the only vital things to take away are: columns are unordered, and rows are unordered. If you are familiar with relational theory, you’re probably angry at me for making so many mistakes, in which case please point them out in the comments.

So what does this mean in intuitive terms? A common intuitive feeling about tables is that they represent a list of entities, and indeed this understanding works nicely for simple cases. Take a table of salaried employees in FictoCorp:

Table showing list of employees in a fictional company

The head of HR for this company might look at the table and say, “yep, those are my employees—I’d recognise ’em anywhere.” As far as they’re concerned, each row in this table represents one of the employees they have to deal with. Furthermore, no row represents more than one employee, and there’s no employee of the company who doesn’t have a row.

It just so happens that FictoCorp (who have a lot of important customers in the netball industry) has a policy that all employees must play for one of the company’s netball teams. In order to keep track of this, the team captain keeps the following table in the company database:

A table showing the netball teams and positions of fictional players

We’ll simplify things by only displaying four employees, though obviously there would be more.

As an aside, netball has the nice property that the positions are named and unique; no player can be on the same team playing in the same position as another player. Therefore the combination of Netball Team and Position uniquely identifies a single employee. Obviously this constraint makes it impossible for FictoCorp to hire or fire people other than in unisex groups of 7 (in order that they can add or remove an entire netball team at once), but hey, it’s worth it for all the lucrative netball-industry contacts.

As far as the netball club captain is concerned, the entries in this table are the employees. Any employee will be in this table, and anyone in this table is an employee. So who is right, the HR manager or the netball club captain? Which table “holds” the employees? And if one table “is” the set of employees, what does that mean about the other table?

A digression

FictoCorp’s netball teams are so successful that the major league teams start to send talent scouts to their games. One day, the manager of a professional team rings up to enquire about hiring one of FictoCorp’s players.

“He was brilliant, we just have to have him … Any price, any price at all … His name? I don’t remember that, but he was definitely playing Wing Attack for your Men’s First team”

Luckily, with this information is all that is needed to identify that the player in question is Charles. The table of netball players worked equally well as a way of finding a player from their netball team and position as vice versa.

From the point of view of an outsider to FictoCorp, the table is a list of teams and playing positions, with the useful effect that the player’s name can be looked up. The talent scout’s view and the club manager’s view of the meaning of the table are different, but both are using the same table.

Resolving the ambiguity

The netball players table is neither a container of people, nor a container of playing positions. Both of these are extrinsic to the table: they will continue to exist if the table is deleted, though FictoCorp may no longer have the information it needs to get the necessary work done.

One way to think of the relation is in terms of the corresponding predicate: a function that takes a group of objects and produces a true or false value. An informal definition of the predicate for the netball players table might be:

There exists a player called X, who plays on team Y in position Z

If we substitute into this values from the table, we get true values from the function:

There exists a player called Alice, who plays on team W1 in position GA (true)

There exists a player called Charles, who plays on team M1 in position WA (true)

If we substitute in other values, we get false values from the function

There exists a player called Charles, who plays on team W1 in position WA (false)

You can think of this as a function on a 3-dimensional space, where one dimension is the list of every person in the world, one dimension is every netball team FictoCorp has and the final dimension is every possible position in a netball team:

Diagram of a relation on a 3-dimensional space

The predicate is a function over this entire 3-dimensional space. The tuples (rows) in the relation represent points in this space for which the function evaluates to true. Tuples that could be in the table, but aren’t, represent points in this space for which the predicate evaluates to false.

Things to note:

  • The predicate evaluates to true or false on every point in this space; nowhere in the space is the predicate undefined
  • The predicate can’t be evaluated anywhere but points in this space; it would be meaningless to do so

In a sense, the predicate give the meaning of the table, and this meaning won’t change as we add and remove players from various teams. The tuples in the relation (the rows in the table) show us what is currently true in the real world. It is a goal of a well-maintained database that the facts implied by the table always remain a true representation of what is true in the real world, for drawing conclusions about the real world is the reason databases exist.

Objections to this model

One obvious objection to this model is that if people, salaries, netball team positions etc. are all extrinsic to the tables, how do we keep track of an entity that happens not to appear in any of the tables? If FictoCorp has a contractor called Edgar working for the company who isn’t in the employees table, and is excused from being in any of the netball teams, how do we keep track of this person?

The answer is that the database contains all the information we want to store, and nothing else. If the database system needs to be able to be used to answer questions about contractors, it will have a contractors table in which Edgar will appear. If for some reason FictoCorp doesn’t care to know what contractors it has relationships with, then Edgar will be a non-entity as far as the database is concerned.

Book Review: SQL and Relational Theory

Front cover of the book "SQL and Relational Theory"

The first thing to know about SQL and Relational Theory is that it’s largely a retread of Chris Date’s previous excellent book Database in Depth. The latter is a favourite of mine: extremely readable, yet with enough theoretical clout to change the way I looked at databases forever. The new volume carries over large chunks of the text from the older one, with some minor tweaks. As the name suggests, it brings in substantial additional material to link relational theory in with SQL, the only practical implementation of the model in current use.

The front cover of "Database in Depth"

In the preface, Date explains that the motivation for the new book was the realisation that practitioners weren’t able to figure out for themselves how to apply his theoretical ideas within SQL. Clearing up this difficulty is an admirable goal, and illustrates well that Date’s approach is practical and not meant as ivory-tower theory, but I can’t help but wonder if one of the reasons he didn’t state was that books sell better with ‘SQL’ in the title.

The additional material has resulted in a book that is roughly twice as long. This isn’t a problem in itself, though it does spoil one of the things I loved about “In Depth”: that it could be read in a couple of evening’s work by a sufficiently motivated person. The importance of making a book light enough that you can sit and read it on the sofa without looking like a database nerd should not be understated.

The prose remains clear and readable, and strikes a nice balance that makes it approachable to relative beginners while avoiding ever sounding patronising. Date’s style is precise to a fault, and some people will find it needlessly pedantic; nevertheless, there isn’t any pointless pedantry here, and if you stick with it you’ll learn why subtle distinctions need to be made.

So how useful are the new insertions on SQL? I find it difficult to tell. On the one hand, it makes it much easier to relate the ideas in this book to discussions of theory that actually occur in the real world, since SQL is the lingua franca. In the old book, it was certainly annoying to have all the examples written in Tutorial D, without a real specification of how the language works. On the other hand, Date’s examples in this book are still in a mythical beast called “Standard SQL”, of which no practical implementation exists. What is good practice in standard SQL might be impossible in your chosen implementation, or there might be a better way to achieve the same thing.

It’s certainly worth buying one of the two books here, but the choice of which is not as obvious. If you already own “In Depth”, the updated version probably isn’t worth buying. If you don’t, then “SQL and Relational Theory” is the thing to buy, unless you’re after a lighter and more portable read.

Relational Database Basics: What is Atomicity?

Atomicity is an important concept in databases, indeed it’s a key part of the definition of first normal form. But it’s a surprisingly slippery concept, and our intuitive ideas don’t seem to serve us well enough.

Codd gave the definition that atomic data is data that “cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions)”. Taken literally and not allowing for ad hoc exclusions, this definition would require that every field be a single boolean value: a string can be decomposed into characters and even an integer can be decomposed into prime factors, if we care to do so. Clearly we can choose a set of allowable operators that give a sensible definition of atomicity, but we risk begging the question.

The observation above leads fairly naturally to the idea that the concept of atomicity is a product of the operators we intend to use on the data. When you start to look at things this way, the intuitive grasp of which relations are in first normal form turns out to be more complicated than you might think. Take the following relation for example, which I’m going to assume everybody will agree is in first normal form:

Database atomicity uncontroversial example

Let’s assume that Alice, Bob and Charles all work on the market selling fruit and vegetables, and that in their part of town the only products that customers have any interest in are Apples, Bananas, Cherries and Durians.

Database atomicity controversial example

Many people would claim that this is not in first normal form, since the “products sold” field is non-atomic. However, there is a fairly simple isomorphism between the two cases.

For a start, we can map our unordered set of products sold into an ordered tuple quite easily, since there is a finite number of elements that are allowed to be in the set (since greengrocers in this part of town can sell only the four products).

Database atomicity isomorphism

However, there’s also a trivial isomorphism between ordered tuples of booleans and integers in an appropriate range, given by the binary encoding of the integer. It so happens that if we assume karate ranks run from 10th Kyu to 6th Dan (essentially -10 to +6, with no zero) we can biject these with the numbers 0 to 15. If you turn the sets of products into tuples this way, and then turn them into numbers, then map these numbers to karate grades, you’ll find that the output data is exactly the same as the first relation, which is in first normal form.

How to make sense of this? Normal forms eliminate (some) redundancy, but they don’t enforce good design. The second table may be in first normal form, but it isn’t good design. The reason that it isn’t good design has nothing to do with relational theory and everything to do with the way in which we intend to use the data. “Does Alice sell Durians?” is a reasonable question to ask, but “Is Alice’s karate rank isomorphic to an odd number?” is a directly equivalent but unreasonable question to ask. As a database designer, it is your job to anticipate as many valid questions as possible, without over-complicating the model to support invalid questions.