Some while ago, I've been reading through the book SQL and Relational Theory by C. J. Date. The author is well-known for criticising SQL's three-valued logic (3VL).1)
The author makes some strong points about why 3VL should be avoided in SQL, however he doesn't outline how a database model would look like if nullable columns weren't allowed. I've thought on this for a bit and have come up with the following solutions. If I missed other design options, I would like to hear about them!
1) Date's critique of SQL's 3VL has in turn been criticized too: see this paper by Claude Rubinson (includes the original critique by C. J. Date).
Example table:
As an example, take the following table where we have one nullable column (DateOfBirth
):
# +-------------------------------------------+
# | People |
# +------------+--------------+---------------+
# | PersonID | Name | DateOfBirth |
# +============+--------------+---------------+
# | 1 | Banana Man | NULL |
# +------------+--------------+---------------+
Option 1: Emulating NULL
through a flag and a default value:
Instead of making the column nullable, any default value is specified (e.g. 1900-01-01
). An additional BOOLEAN
column will specify whether the value in DateOfBirth
should simply be ignored or whether it actually contains data.
# +------------------------------------------------------------------+
# | People' |
# +------------+--------------+----------------------+---------------+
# | PersonID | Name | IsDateOfBirthKnown | DateOfBirth |
# +============+--------------+----------------------+---------------+
# | 1 | Banana Man | FALSE | 1900-01-01 |
# +------------+--------------+----------------------+---------------+
Option 2: Turning a nullable column into a separate table:
The nullable column is replaced by a new table (DatesOfBirth
). If a record doesn't have data for that column, there won't be a record in the new table:
# +---------------------------+ 1 0..1 +----------------------------+
# | People' | <-------> | DatesOfBirth |
# +------------+--------------+ +------------+---------------+
# | PersonID | Name | | PersonID | DateOfBirth |
# +============+--------------+ +============+---------------+
# | 1 | Banana Man |
# +------------+--------------+
While this seems like the better solution, this would possibly result in many tables that need to be joined for a single query. Since OUTER JOIN
s won't be allowed (because they would introduce NULL
into the result set), all the necessary data could possibly no longer be fetched with just a single query as before.
Question:
Are there any other options for eliminating NULL
(and if so, what are they)?
I saw Date's colleague Hugh Darwen discuss this issue in an excellent presentation "How To Handle Missing Information Without Using NULL", which is available on the Third Manifesto website.
His solution is a variant on your second approach. It's sixth normal form, with tables to hold both Date of Birth and identifiers where it is unknown:
# +-----------------------------+ 1 0..1 +----------------------------+
# | People' | <-------> | DatesOfBirth |
# +------------+----------------+ +------------+---------------+
# | PersonID | Name | | PersonID | DateOfBirth |
# +============+----------------+ +============+---------------+
# | 1 | Banana Man | ! 2 | 20-MAY-1991 |
# | 2 | Satsuma Girl | +------------+---------------+
# +------------+----------------+
# 1 0..1 +------------+
# <-------> | DobUnknown |
# +------------+
# | PersonID |
# +============+
# | 1 |
# +------------+
Selecting from People then requires joining all three tables, including boilerplate to indicate the unknown Dates Of Birth.
Of course, this is somewhat theoretical. The state of SQL these days is still not sufficiently advanced to handle all this. Hugh's presentation covers these shortcomings. One thing he mentions is not entirely correct: some flavours of SQL do support multiple assignment - for instance Oracle's INSERT ALL syntax.