Category Archives: Database

You were never meant to do that with SQL

There seems to be a lot of hatred for SQL in the world at the moment: I can’t think of any other reason why the term NoSQL would catch on in the way that it has, when the key technological distinction is actually the lack of ACID guarantees (which are entirely orthogonal to whether or not SQL is used, as evidenced by non-ACID MySQL and HiveQL, which offers a pretty familiar SQL-like interface on an entirely non-traditional backend).

I wonder whether one of the unspoken reasons for this hatred is that at one point or another almost everyone has ended up doing this sort of thing:

   builder.Add("SELECT foo FROM bar WHERE id = ");
   builder.Add(id.ToString());

   if (additionalConstraint)
   {
      builder.Add(" AND frobbable = 1 ");
   }

   /* ... ad nauseam ... */

SQL is a hard language to like: it’s never been properly standardised (or rather, it has, but the standard has never been implemented) meaning that you spend too much time worrying about compatibility. Its theoretical underpinning is poor, leading to constructions that are hard for the engine to optimise (meaning more manual work).

However, SQL is a language in its own right, and was never intended to be generated programmatically by another programming language. This shouldn’t come as a surprise, as I struggle to think of any programming language that has been designed to work in this way.

Using SQL from a decent command-line environment is a powerful tool and often a pleasure to use. By comparison, generating SQL programmatically is an abomination that would be worth of The Daily WTF were it not for the fact that nobody’s ever invented an API that offers the same flexibility.

Personally, I blame the vendors. Until RDBMSs can offer the same quality of optimisation that modern compilers can (that is, write in a high-level language and never even think about micro-optimisation) high-performance relational database access will remain a sea of vendor-specific optimiser hacks. Maybe there’s a theoretical reason why optimisers will never be this good, in which case perhaps we do need to abandon the relational model in practice. But let’s not pretend it has anything to do with SQL.

Relational Database Basics: What is a relation?

The biggest misunderstanding people tend to have with the relational model must be the understanding of the term “relation” itself. Since people tend to learn relational theory as an add-on to learning about SQL, they naturally learn that the things you put the data in are called “tables” and that tables are related to each other. The natural (but incorrect) assumption then, is that “relational” refers to the relationships that exist between tables, and this couldn’t be more wrong.

The simplest explanation

Put simply, a “relation” is what SQL calls a table. If you learn nothing else about relational theory, at least understand this. This is an oversimplification of course, but it’s close enough to being true that if you don’t want to learn any theory, it will at least make the discussions of theorists more comprehensible.

The mathematical explanation

This isn’t the best way to understand what a relation is, but if you intend to have meaningful discussions with other practitioners, you will need to have a common understanding based on a definition that is unambiguous. I’ll therefore get the mathematical explanation out of the way here; if it doesn’t make much sense, return to it after the more intuitive description below. Though this is a relatively formal description, it doesn’t come close to being totally precise, and anyone who wants to know more is encouraged to investigate a book on the subject.

An attribute is a combination of a name and a  type identifier, where we can for the moment treat a type as being a (possibly infinite) set of values with some operators defined on it. Think of an attribute as being like a column definition.

A tuple is a set of distinct attributes, where each attribute is associated with one value that is an instance of the type for that attribute. The members of a tuple do not posess an inherent order, and tuples ordered in different ways for display purposes nevertheless represent the same tuple.

A relation consists of a heading and a body. The heading is a (possibly empty) set of attributes with distinct names. The body is a (possibly empty) set of tuples, each of which has the same set of attributes as the heading of the relation.

To put this in terms familiar to an SQL user: an attribute is analogous to a column definition, a tuple is analogous to a row and a relation is analogous to a table. Note that this is an over-simplification, mostly because we think of the rows and columns of a table as posessing an inherent order, and mathematical relations have no such order.

One other thing that bears stating at this point is that a relation is technically an immutable value, and is held in a mutable variable called a relvar. This is analogous to common programming practice where an integer like 5 is immutable, but held in a mutable integer variable. If you “insert a row into a table”, then actually you change the contents of that relvar from one relation to another. This distinction is rarely of relevance in discussing theoretical issues.

The intuitive explanation

Unless you’re already familiar with relational theory, that was probably all rather unclear, in which case the only vital things to take away are: columns are unordered, and rows are unordered. If you are familiar with relational theory, you’re probably angry at me for making so many mistakes, in which case please point them out in the comments.

So what does this mean in intuitive terms? A common intuitive feeling about tables is that they represent a list of entities, and indeed this understanding works nicely for simple cases. Take a table of salaried employees in FictoCorp:

Table showing list of employees in a fictional company

The head of HR for this company might look at the table and say, “yep, those are my employees—I’d recognise ’em anywhere.” As far as they’re concerned, each row in this table represents one of the employees they have to deal with. Furthermore, no row represents more than one employee, and there’s no employee of the company who doesn’t have a row.

It just so happens that FictoCorp (who have a lot of important customers in the netball industry) has a policy that all employees must play for one of the company’s netball teams. In order to keep track of this, the team captain keeps the following table in the company database:

A table showing the netball teams and positions of fictional players

We’ll simplify things by only displaying four employees, though obviously there would be more.

As an aside, netball has the nice property that the positions are named and unique; no player can be on the same team playing in the same position as another player. Therefore the combination of Netball Team and Position uniquely identifies a single employee. Obviously this constraint makes it impossible for FictoCorp to hire or fire people other than in unisex groups of 7 (in order that they can add or remove an entire netball team at once), but hey, it’s worth it for all the lucrative netball-industry contacts.

As far as the netball club captain is concerned, the entries in this table are the employees. Any employee will be in this table, and anyone in this table is an employee. So who is right, the HR manager or the netball club captain? Which table “holds” the employees? And if one table “is” the set of employees, what does that mean about the other table?

A digression

FictoCorp’s netball teams are so successful that the major league teams start to send talent scouts to their games. One day, the manager of a professional team rings up to enquire about hiring one of FictoCorp’s players.

“He was brilliant, we just have to have him … Any price, any price at all … His name? I don’t remember that, but he was definitely playing Wing Attack for your Men’s First team”

Luckily, with this information is all that is needed to identify that the player in question is Charles. The table of netball players worked equally well as a way of finding a player from their netball team and position as vice versa.

From the point of view of an outsider to FictoCorp, the table is a list of teams and playing positions, with the useful effect that the player’s name can be looked up. The talent scout’s view and the club manager’s view of the meaning of the table are different, but both are using the same table.

Resolving the ambiguity

The netball players table is neither a container of people, nor a container of playing positions. Both of these are extrinsic to the table: they will continue to exist if the table is deleted, though FictoCorp may no longer have the information it needs to get the necessary work done.

One way to think of the relation is in terms of the corresponding predicate: a function that takes a group of objects and produces a true or false value. An informal definition of the predicate for the netball players table might be:

There exists a player called X, who plays on team Y in position Z

If we substitute into this values from the table, we get true values from the function:

There exists a player called Alice, who plays on team W1 in position GA (true)

There exists a player called Charles, who plays on team M1 in position WA (true)

If we substitute in other values, we get false values from the function

There exists a player called Charles, who plays on team W1 in position WA (false)

You can think of this as a function on a 3-dimensional space, where one dimension is the list of every person in the world, one dimension is every netball team FictoCorp has and the final dimension is every possible position in a netball team:

Diagram of a relation on a 3-dimensional space

The predicate is a function over this entire 3-dimensional space. The tuples (rows) in the relation represent points in this space for which the function evaluates to true. Tuples that could be in the table, but aren’t, represent points in this space for which the predicate evaluates to false.

Things to note:

  • The predicate evaluates to true or false on every point in this space; nowhere in the space is the predicate undefined
  • The predicate can’t be evaluated anywhere but points in this space; it would be meaningless to do so

In a sense, the predicate give the meaning of the table, and this meaning won’t change as we add and remove players from various teams. The tuples in the relation (the rows in the table) show us what is currently true in the real world. It is a goal of a well-maintained database that the facts implied by the table always remain a true representation of what is true in the real world, for drawing conclusions about the real world is the reason databases exist.

Objections to this model

One obvious objection to this model is that if people, salaries, netball team positions etc. are all extrinsic to the tables, how do we keep track of an entity that happens not to appear in any of the tables? If FictoCorp has a contractor called Edgar working for the company who isn’t in the employees table, and is excused from being in any of the netball teams, how do we keep track of this person?

The answer is that the database contains all the information we want to store, and nothing else. If the database system needs to be able to be used to answer questions about contractors, it will have a contractors table in which Edgar will appear. If for some reason FictoCorp doesn’t care to know what contractors it has relationships with, then Edgar will be a non-entity as far as the database is concerned.

Book Review: SQL and Relational Theory

Front cover of the book "SQL and Relational Theory"

The first thing to know about SQL and Relational Theory is that it’s largely a retread of Chris Date’s previous excellent book Database in Depth. The latter is a favourite of mine: extremely readable, yet with enough theoretical clout to change the way I looked at databases forever. The new volume carries over large chunks of the text from the older one, with some minor tweaks. As the name suggests, it brings in substantial additional material to link relational theory in with SQL, the only practical implementation of the model in current use.

The front cover of "Database in Depth"

In the preface, Date explains that the motivation for the new book was the realisation that practitioners weren’t able to figure out for themselves how to apply his theoretical ideas within SQL. Clearing up this difficulty is an admirable goal, and illustrates well that Date’s approach is practical and not meant as ivory-tower theory, but I can’t help but wonder if one of the reasons he didn’t state was that books sell better with ‘SQL’ in the title.

The additional material has resulted in a book that is roughly twice as long. This isn’t a problem in itself, though it does spoil one of the things I loved about “In Depth”: that it could be read in a couple of evening’s work by a sufficiently motivated person. The importance of making a book light enough that you can sit and read it on the sofa without looking like a database nerd should not be understated.

The prose remains clear and readable, and strikes a nice balance that makes it approachable to relative beginners while avoiding ever sounding patronising. Date’s style is precise to a fault, and some people will find it needlessly pedantic; nevertheless, there isn’t any pointless pedantry here, and if you stick with it you’ll learn why subtle distinctions need to be made.

So how useful are the new insertions on SQL? I find it difficult to tell. On the one hand, it makes it much easier to relate the ideas in this book to discussions of theory that actually occur in the real world, since SQL is the lingua franca. In the old book, it was certainly annoying to have all the examples written in Tutorial D, without a real specification of how the language works. On the other hand, Date’s examples in this book are still in a mythical beast called “Standard SQL”, of which no practical implementation exists. What is good practice in standard SQL might be impossible in your chosen implementation, or there might be a better way to achieve the same thing.

It’s certainly worth buying one of the two books here, but the choice of which is not as obvious. If you already own “In Depth”, the updated version probably isn’t worth buying. If you don’t, then “SQL and Relational Theory” is the thing to buy, unless you’re after a lighter and more portable read.